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Abstract 

Hidden Markov Models are one of the most popular and successful techniques used 
in statistical pattern recognition. However, they are not well understood on a funda- 
mental level. For example, we do not know how to characterize the class of processes 
that can be well approximated by HMMs. This thesis tries to uncover the source 
of the intrinsic expressiveness of HMMs by studying when and why two models may 
represent the same stochastic process. Define two statistical models to be equiva- 
lent if they are models of exactly the same process. We use the theorems proved in 
this thesis to develop polynomial time algorithms to detect equivalence of prior dis- 
tributions on an HMM, equivalence of HMMs and equivalence of HMMs with fixed 
priors. We characterize Hidden Markov Models in terms of equivalence classes whose 
elements represent exactly the same processes and proceed to describe an algorithm 
to reduce HMMs to essentially unique and minimal, canonical representations. These 
canonical forms are essentially "smallest representatives" of their equivalence classes, 
and the number of parameters describing them can be considered a representation for 
the complexity of the stochastic process they model. On the way to developing our 
reduction algorithm, we define Generalized Markov Models which relax the positivity 
constraint on HMM parameters. This generalization is derived by taking the view 
that an interpretation of model parameters as probabilities is less important than a 
parsimonious representation of stochastic processes. 
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Chapter 1 



Introduction and Basic Definitions 



1.1 Overview 

Hidden Markov Models (HMMs) are one of the more popular and successful tech- 
niques for pattern recognition in use today. For example, experiments in speech recog- 
nition have shown that HMMs can be useful tools in modelling the variability of hu- 
man speech. ([Juang91],[lee88],[rabiner86],[bahl88]) Hidden Markov Models have also 
been used in computational linguistics [kupiec90], in document recognition [kopec91] 
and in such situations where intrinsic statistical variability in data must be accounted 
for in order to perform pattern recognition. HMMs are constructed by considering 
stochastic processes that are probabilistic functions of Markov Chains. The under- 
lying Markov Chain is never directly measured and hence the name Hidden Markov 
Model. 1 An example of an HMM could be the artificial economy of Figure 1.1. The 
economy in this figure transitions probabilistically between the states Depressed, Nor- 
mal, and Elevated. The average stock price in each of these states is a probabilistic 
function of the state. Typically, pattern recognition using Hidden Markov Models is 
carried out by building HMM source models for stochastic sequences of observations. 



1 Hidden Markov Models are also closely related to Probabilistic Automata. Appendix A discusses 
the connections in detail and shows that with appropriate definitions of equivalence, HMMs can be 
considered a subclass of probabilistic automata. 

7 



CHAPTER 1 . INTROD UCTION AND BASIC DEFINITIONS 



$10 




= state 

h> = transition between states with probability x 
= output A emitted with probability y 



This artificial economy is always found in one of three 
states: Depressed, Normal or Elevated. Given that it is 
in one of these states it tends to stay there . The average 
daily price of stocks is a probabilistic function of the 
state. For example, when the economy is Normal, the 
average price of stocks is $10 with probability 0.6, and 
is $5 with probability . 15 . 



Figure 1-1: A Hidden Markov Model Economy 



A given sequence is classified as arising from the source whose HMM model has the 
highest a posteriori likelihood of producing it. Despite their popularity and relative 
success, HMMs are not well understood on a fundamental level. This thesis attempts 
to lay part of a foundation for a more principled use of Hidden Markov Models in 
pattern recognition. In the next section I will briefly describe the history of func- 
tions of Markov Chains as relevant to this thesis. I will then proceed to discuss the 
motivations underlying this research and the major questions that I address here. 
Principally, these questions will involve the development of fast algorithms for decid- 
ing the equivalence of HMMs and reducing them to minimal canonical forms. The 
chapter will conclude by introducing the basic definitions and notation necessary for 
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understanding the rest of the thesis. 

1.2 Historical Overview 

As mentioned in the previous section, Hidden Markov Models are probabilistic func- 
tions of Markov Chains, of which the artificial economy in Figure f.f is an example. 
The concept of a function of a Markov Chain is quite old and the questions answered 
in this thesis seem to have been hrst posed by Blackwell and Koopmans in f 957 in the 
context of related deterministic functions of Markov Chains. [blackwell57] This work 
sought to find necessary and sufficient conditions that would "identify" equivalent 
deterministic functions of Markov Chains, and studied the question in some special 
cases. Gilbert, in 1959, provided a more general, but still partial, answer to this 
question of "identihability" of deterministic functions of Markov Chains. [gilbert59] 
The topic was studied further by several authors who elucidated various aspects 
of the problem. ([burke58], [dharma63a], [dharma63b], [dharma68], [bodreau68], 
[rosenblatt71]) Functions of Markov Chains were also studied under the rubric "Grouped 
Markov Chains", and necessary and sufficient conditions were established for equiva- 
lence of a Grouped Chain to a traditional Markov Chain. ([kemeney65], [iosifescu80]) 
Interest in functions of Markov Chains, and particularly, probabilistic functions of 
Markov Chains, has been revived recently because of their successful applications in 
speech recognition. The most effective recognizers in use today employ a network 
of HMMs as their basic technology for identifying the words in a stream of spoken 
language. ([lee88],[levinson83]) Typically, the HMMs are used as probabilistic source 
models which are used to compute the posterior probabilities of a word, given a model. 
This thesis arises from an attempt to build part of a foundation for the principled use 
of HMMs in pattern recognition applications. We provide a complete characterization 
of equivalent HMMs and give an algorithm for reducing HMMs to minimal canon- 
ical representations. Some work on the subject of equivalent functions of Markov 
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Chains has been done concurrently with this thesis in Japan. [ito92] However, Ito et 
al. work with less general deterministic functions of Markov Chains, and find an 
algorithm for checking equivalent models that takes time exponential in the size of 
the chain. (In this thesis, we achieve polynomial time algorithms in the context of 
more general probabilistic functions of Markov Chains.) Some work has been done by 
Y.Kamp on the subject of reduction of states in HMMs.[kamp85] However, Kamp's 
work only considers the very limited case of reducing pairs of states with identical 
output distributions, in left-to-right models. There has also been some recent work 
in the theory of Probabilistic Automata (PA) which uses methods similar to ours 
to study equivalence of PAs.[tzeng] Tzeng cites the work of Azaria Paz [paz71] and 
others as achieving the previous best results for testing equivalence of Probabilistic 
Automata. 2 Appendix A will define Probabilistic Automata and discuss their con- 
nections with HMMs. In Chapter 3 we will define Generalized Markov Models, a new 
class of models for stochastic processes that are derived by relaxing the positivity 
constraint on some of the parameters of HMMs. The idea of defining GMMs arises 
from work by L.Niles, who studied the relationship between stochastic pattern clas- 
sifiers and "neural" network schemes. [niles90] Niles demonstrated that relaxing the 
positivity constraint on HMM parameters had a beneficial effect on the performance 
of speech classifiers. He proceeded to interpret the negative weights as inhibitory 
connections in a network formulation of HMMs. 



1.3 The Major Questions 

Despite their popularity and relative success HMMs, are not well understood on 
a theoretical level. If we wish to apply these models in a principled manner to 



2 Paz's results placed the problem of deciding equivalence of Probabilistic Automata in the com- 
plexity class co-NP. It is well known that equivalence of deterministic automata is in P and equiv- 
alence of nondeterministic automata is PSPACE-complete. Tzeng decides equivalence of PAs in 
polynomial time using methods similar to ours. 
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Bayesian classification, we should know that HMMs are able to accurately represent 
the class-conditional stochastic processes appropriate to the classification domain. 
Unfortunately, we do not understand in detail the class of processes that can be 
modelled exactly by Hidden Markov Models. Even worse, we do not know how many 
states an HMM would need in order to approximate a given stochastic process to 
a given degree of accuracy. We do not even have a good grasp of precisely what 
characteristics of a stochastic process are difficult to model using HMMs. 3 There is 
a wide body of empirical knowledge that practitioners of Hidden Markov Modelling 
have built up, but I feel that the collection of useful heuristics and rules of thumb 
they represent are not a good foundation for the principled use of HMMs in pattern 
recognition. This thesis arises from some investigations into the properties of HMMs 
that are important for their use as pattern recognizers. 

1.3.1 Intuitions and Directions 

The basic intuition underlying a comparison of the relative expressiveness of Hidden 
Markov Models and the well-understood Markov Chains suggests that HMMs should 
be more "powerful" since we can store information concerning the past in probability 
distributions that are induced over the hidden states. This stored information per- 
mits the output of a finite-state HMM to be conditioned on the entire past history 
of outputs. This is in contrast with a finite-state Markov Chain which can be condi- 
tioned only on a finite history. On the other hand, the amount of information about 
the output of an HMM at time t, given by the output at time (t — n), should drop off 
with n. It can also be seen that there are many HMMs that are models of exactly the 
same process, implying that there can be many redundant degrees of freedom in a 
Hidden Markov Model. This leads to the auxiliary problem of trying to characterize 
Hidden Markov Models in terms of equivalence classes that are models of precisely 



3 We can, however, reach some conclusions quickly by considering analogous questions for Finite 
Automata. For example, it should not be possible to build a finite-state HMM that accurately 
models the long-term statistics of a source that emits pallindromes with high probability. 
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the same process. Such an endeavour would give some insight into the features ol an 
HMM that contribute to its expressiveness as a model for stochastic processes. Given 
a characterization in terms of equivalence classes, every HMM could be reduced to 
a canonical form which would essentially be the smallest member of its class. This 
is a prerequisite for the problem of characterizing the processes modelled by HMMs, 
since we should, at the very least, be able to say what makes one model different from 
another. Furthermore, the canonical representation of an HMM would presumably 
remove many of the superfluous features of the model that do not contribute to its in- 
trinsic expressiveness. Therefore, we could more easily understand the structure and 
properties of Hidden Markov Models by studying their canonical representations. In 
addition, a minimal representation for a stochastic process within the HMM frame- 
work is an abstract measure of the complexity of the process. This idea has some 
interesting connections with Minimum Description Length principles and ideas about 
Kolmogorov Complexity. However, these connections are not explored in this thesis. 

1.3.2 Contributions of This Thesis 

Keeping the goals described above in mind, I have developed quick methods to decide 
equivalence of Hidden Markov Models and reduce them to minimal canonical forms. 
On the way, I introduce a convenient generalization of Hidden Markov Models that re- 
laxes some of the constraints imposed on HMMs by their probabilistic interpretation. 
These Generalized Markov Models (GMMs), defined in Chapter 3, preserve the essen- 
tial properties of HMMs that make them convenient pattern classifiers. They arise 
from the point of view that having a probabilistic interpretation of HMM parameters 
is peripheral to the goal of designing convenient and parsimonious representations 
for stochastic processes. The reduction algorithm for Hidden Markov Models will, 
in fact, reduce HMMs to their minimal equivalent GMMs. Towards the end of the 
thesis, I will also briefly consider the problem of approximate equivalence of models. 
This is important because, in any practical situation, HMM parameters are estimated 
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from data and are subject to statistical variability. I have listed the major results of 
this thesis below. I have developed: 

1. A polynomial time algorithm to check equivalence of prior probability dis- 
tributions on a given model. 

2. A polynomial time algorithm to check equivalence of HMMs with fixed 
priors. 

3. A polynomial time algorithm to check the equivalence of HMMs for arbi- 
trary priors. 

4. A definition for a new type of classifier, a Generalized Markov Model 
(GMM), that is derived by relaxing the positivity constraint on HMM pa- 
rameters. We will give a detailed description of the relationship between 
HMMs and GMMs. 

5. A polynomial time algorithm to canonicalize a GMM by reducing it to a 
minimal equivalent model that is essentially unique. The minimal repre- 
sentation, when appropriately restricted, will be a minimal representation 
of HMMs in the GMM framework. The result will also involve a charac- 
terization of the essential degree of expressiveness of a GMM. 

We will see that all these results are easy to achieve when cast the language of 
linear vector spaces. The problems discussed here have remained open for quite a 
long time because they were not cast in the right language for easy solution. 

1.4 Basic Definitions 

In this section I will define Hidden Markov Models formally, and I will introduce the 
basic notation and concepts that will be useful in later chapters. 
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1.4.1 Hidden Markov Models 

Definition 1.1 (Hidden Markov Model) 

A Hidden Markov Model can be defined as a quadruple AA = (<S, (9, A, B) where 
Si G S are the states of the model and Oj G O are the outputs. Taking s(t) to be 
the state and oit) the output of AA. at time t, we also define the transition matrix 
A and the output matrix B so that A 8J = Pr(s(t) = Si\s(t — 1) = Sj) and B 8J = 
Pr(o(t) = Oi\s(t) = Sj). In this thesis we only consider HMMs with discrete and finite 
state and output sets. So, for future use we also let n = \S\ and k = \0\. 

In order for an HMM to model a stochastic process, it must be initialized by 
specifying an initial probability distribution over states. The model then transitions 
probabilistically between its states based on the parameters of its transition matrix 
and emits symbols based on the probabilities in its output matrix. Therefore, we 
define an Initialized Hidden Markov Model as follows: 

Definition 1.2 (Initialized Hidden Markov Model) 

An Initialized Hidden Markov Model is a quintuple AA = (<S, (9, A, B, p) . The symbols 
S, O, A, and B represent the same quantities as they do in Definition 1.1. p is 
probability vector such that pi is the probability that the model starts in state s 8 - at 
time t = 0. We take p to be a column vector. Having fixed the priors, the model 
may be evolved according to the probabilities encoded in the transition matrix A and 
the output matrix B. If Af is a given Hidden Markov Model, we will use the notation 
Af(p) to denote the HMM Af initialized by the prior p. 

Figure 1.2 shows an example of a Hidden Markov Model as defined above. Our 
definition is slightly different from the standard definition of HMMs which actually 
corresponds to our Initialized Hidden Markov Models. In our formulation, an HMM 
defines a class of stochastic processes corresponding to different settings of the prior 
probabilites on the states. An Initialized Hidden Markov Model is a specific process 
derived by fixing a prior on an HMM. 
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O = { a, b, c } 



Figure 1-2: A Hidden Markov Model 

1.4.2 Variations on The Theme 

It should be pointed out that many variants ol Hidden Markov Models appear in 
the literature. Authors have frequently used models in which the outputs are as- 
sociated with the transitions rather than the states. It can be shown quite easily 
that it is always possible to convert such a model into an equivalent HMM accord- 
ing to our definition. 4 However, for somewhat technical reasons, converting from a 
hidden-transition HMM to a hidden-state HMM requires, in general, an increase in 
the number of states. The literature also frequently uses models with continuously 
varying observables. These are easily defined by replacing the "ouput matrix" B by 
continuous output densities. HMMs with Gaussian output densities are related to 
the Radial Basis Functions of [poggio89]. 5 Some authors also designate "absorbing 
states" which, when entered, cause the model to terminate production of a string. 



4 This is analogous to the equivalence of Moore and Mealy Finite State Machines 
5 Suppose M is a Hidden Markov Model with states S = {s\,S2, ■ ■ ■ , s n } and Gaussian output 
s 2 , ■ ■ • , G Sn } associated with the states. Also let x = (o(l), o(2), • • • , o(t)) is an 



distributions {G Sl ,G, 
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The analysis ol such absorbing models is somewhat different from that of the HMMs 
in Definition 1.1 for uninteresting technical reasons. For the substantive problems of 
pattern recognition an absorbing model can always be "simulated" in our formula- 
tion by creating a state which emits a single special output symbol and loops with 
probability 1 onto itself. 

1.4.3 Induced Probability Distributions 

As described in the definition of Initialized HMMs, a stochastic process can be mod- 
elled using Hidden Markov techniques by constructing an appropriate HMM, initial- 
izing it by specifying a prior probability distribution on the states, and then evolving 
the model according to its parameters. This evolution then produces output strings 
whose statistics define a stochastic process over the output set of the model. In 
recognition applications we are usually interested in the probability that a given ob- 
servation string was produced by a source whose model is a given HMM. We quantify 
this by defining the probability distribution over strings induced by an Initialized 
Hidden Markov Model: 

Definition 1.3 (Induced Probability Distributions) 

Suppose we are given an HMM KA = (<S, (9, A, B) and a prior distribution p. Borrow 
the standard notation of the theory of regular languages, and let O* denote the set of 
all finite length strings that can be formed by concatenating symbols in O together. 

We then define the probability that a given string x £ O* is produced by A4(p) as 



output string of length t, Then we can use Equation 1.1 to write: 

Vi(x\M,p) = J2 P<s(l),---,s(t)\M,p)Pr(x\ S (l),---, S (t)) 

S (i),--,»(t) 

J2 PT(s(l),---,s(t)\M,p)G, w [o(l)]---G, w [o(t)] 

Each of the products of Gaussians in the second equation defines a "center" for a Radial Basis 
Function. The sum over states then evaluates a weighted sum over the activations of the various 
"centers" which are produced as appropriate permutations of the G Si 
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follows. Let m = \x\ and let si, s 2 • • • s m G S . Then: 

:\MJ) = Pt{x\M{p),\x\)= J2 PT(s 1 ,---s m \M(p))PT(x\s 1 ,---s m ) (1.1) 



Sl,S2, - "Sr; 



Essentially, given a model KA, the probability of a string x of length m is the likelihood 
that the model will emit the string x while traversing any length m sequence of states. 
Because the definition conditions the probability on the length of the string, Pt(x\A4 7 p) 
defines a probability distribution over strings x of length m for each postive integer 
m. We let e represent the null string and set Pr(e|.A/f,p) = 1. 

The probability distributions denned above specify the statistical properties of the 
stochastic process for which the HMM initialized by p is a source model. Typical 
pattern recognition applications evaluate this "posterior probability" of an observa- 
tion sequence given each of a collection of models and classify according to the model 
with the highest likelihood. 

So an HMM defines a class of stochastic processes - each process corresponding 
to a different choice of initial distribution on the states. This immediately raises the 
question of testing whether two prior distributions on a given model induce identical 
processes. In Chapter 2 we will see that there is an efficient algorithm for deciding 
this question. But first, in the next section, we will introduce some notation and 
techniques that show how to use the basic definitions to calculate the quantities of 
interest to us. 

1.5 How To Calculate With HMMS 

The basic quantity we are interested in calculating is the probability a given string will 
be produced by a given model. We will see later that for purposes of determining the 
equivalence of models and reducing them to canonical forms it also useful to compute 
various probability distributions over the states and the outputs. In this section we 
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will introduce some notation that will enable us to mechanize the computation of these 
quantities so that later analysis becomes easy. The notation and details may become 
tedious and confusing and so the reader may wish to skim the section, referring back 
to it as necessary. 

Definition 1.4 (State and Output Distributions) 

Let KA = (<S,(9, A,B) be an HMM with a prior p, n states and k outputs. Let s(t) 
and oit) be, respectively, the state and output at time t. Let kit) be an n-dimensional 
column vector such that kiit) = Pr(s(t) = s 8 -, oil), o(2), • • • , oit — l)\A4 } p). In other 
words, kiit) is the joint probability of being in s 8 - at time t and seeing all the previous 
outputs. We define rriiit) to be the probability of being in state s 8 - after also seeing 
the output at time t: rriiit) = Pr(s(t) = s 8 , o(l), • • • , o(t — l),o(t)\A4,p). Finally, let 
l(t) be a column vector describing the probabilities of the various outputs at time t: 
li(t) = Pr(o(t) = o 8 ', o(l), o(2), • • • oit — 1)| A4 } p). From the defintion of the B matrix, 
we can write this as: l(t) = Hk(t). 

In order to determine equivalence of HMMs and reduce them to canonical forms 
we will need to be able to reason conveniently about the temporal evolution of the 
model. Using Definition 1.4 we can write that kit + 1) = Am(t). Furthermore, if 
oit) = o 3 we can factor the definition of fhit) to write: 

rriiit) = Pr(o(t) = o 3 \s{t) = s { ,M,p, o(l), ■■■,o{t- 1)) Pr(s(*) = s { , o(l), • • • , o(t - l)\M,p) 
= PT(o(t) = Oj\s(t) = Si )ki(t) 
= Bjikiit) (1.2) 

In order to write Equation 1.2 more compactly, we introduce the following notion of 
a projection operator: 

Definition 1.5 (Projection Operators) 

Suppose an HMM M. = (<S,0,A,B) has k outputs. We define a set of projection 
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operators {Bi, B 2 , • • • B^} so that B 8 = Diag[ith row of B]. In other words B 8 «'s a 
diagonal matrix whose diagonal elements are the row in B corresponding to the output 
symbol 0{. Sometimes we will use the notation B to mean the projection operator 
corresponding to the the output o. (i.e. B = J2o t eo ^(°? °i)^i where 6(a } b) is 1 if 
a = b and otherwise.) Suppose v is a vector whose dimension equals the number 
of states of the model. Then multiplying v by B 8 weights each component of v by the 
probability that the corresponding state would emit the output 0{. 

We can use the projection operator notation to compactly write Equation 1.2 as 
mit) = B ( t )A;(t). Now we can write k(t-\-l) = AB„( t )^(t) and rh(t-\-l) = B ( 1+1 )Ara(f). 
In order to summarize this we introduce a set of definitions for the transition operators 
of a Hidden Markov Model. 

Definition 1.6 (Transition Operators) 

Given an HMM AA = (<S,(9,A,B) with n states we define the model transition 
operators as follows. Let e be the null string. Define T(e) = I where I is the n X n 
identity matrix. Also, for every Oi £ O define T(ofc) = AB 0A .. We can see that T(ok)ij 
is the probability of emitting Ok in state Sj and then entering s 8 -. We extend these to 
be transition operators on O* as follows. For any output string x = (oi, o 2 • • • o t ) £ O* 
let: 

T(x) = T(o l7 ■■■o t ) = T(o t )T(o t -!) . . . T ( 0l ) (1.3) 

We can interpret these extended transition operators by noticing that T(x) 8J is the 
probability of starting in state Sj, emitting the string x, and then entering state s 8 -. 

Using the transition operators of Definition 1.6 we can coveniently write all the quan- 
tities we wish to compute. Suppose M. is an HMM with n states, k outputs and prior 
p. Take x t to be the output string (oi, o 2 • • • o t ) and 1 to be an n-dimensional vector 
all of whose entries are 1. Also let x t ~\ be the t — 1 long prefix of x t . Then we can 



fhit) = 


B 0t T(x t _i)p 


k(t + l) = 


Am(i) = T(x t )p 


T(t + i) = 


Bk(t + 1) =BT(x t )p 


'r(x t \M,p) = 


l-(T(x t )p) 
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see that: 

(1.4) 
(1.5) 
(1.6) 
(1.7) 

The reader may wish to verify some of these equations from the definitions to ensure 
his or her facility with the notation. 

1.6 Roadmap 

This chapter has developed the background necessary for understanding the results 
in this thesis. The basic definitions and notation given here are summarized in Ta- 
ble 1.1. Chapter 2 discusses the algorithms related to equivalence of Hidden Markov 
Models. Chapter 3 defines Generalized Markov Models and describes the algorithm 
for reducing HMMs to minimal canonical forms. Chapter 3 also contains a funda- 
mental characterization of the essential expressiveness of a Hidden Markov Model. 
Chapter 4 presents some preliminary ideas concerning several topics including ap- 
proximate equivalence and potential practical applications of the results of this thesis. 
Finally, Appendix A shows how HMMs, in the formulation of this paper, are related 
to Probabilistic Automata. 
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Given: an HMM KA = (<S, (9, A, B) with n states, k outputs and prior p. 
Definitions: 

1. Pi(x\M,p) = Pi(x\M(p), \x\) = 
J2s 1 ,s 2 ,-s m Pr(si, • • • s m \M(p)) Pr(a;|si, • • • s m ) 

2. Pv(e\A4,p) = 1 where e is the null string 

3. kit) is an n- dimensional vector such that 
ki(t) = Pr(s(t) = Si ,o(l),o(2),---,o(t - 
l)\MJ) 

4. fhit) is an n-dimensional vector such that 
rrii(t) = Pr(s(t) = s,-o(l), • • • o(t - 
l),o(t)\M,p) 

5. l(t) is a k-dimensional vector such that lj(t) = 
Pv(o(t) = Oio(l),o(2)---o(t-l)\M,p) 

6. The projection operators {Bi, B 2 , • • • B/J 
are dehned as B 8 - = Diag[ith row of B]. Also 
if o G O then we write B to denote the pro- 
jection operator corresponding to output o. 

7. We define transition operators so that: 

T(e) = I 

T(o fc ) = AB fc , 
T(o(f),o(2),..-o(t)) = T( (t))---T( (2))T(o(l)) 

Model Evolution: 

f. Suppose the HMM emits the output x t = 
[o(l ), o(2), • • • o(t)]. Also use the notation x t -\ 
to mean the t—l long prefix of x t , and the sym- 
bol f to mean the n-dimensional vector all of 
whose entries at 1. Then we can write: 

• m(t) = B o(t) T(x t _i)p 

• k(t + 1) = Am(t) = T(x t )p 

• l(t + 1) = Bk(t + 1) = BT(x t )p 

• Pr(x t \M,p) = l-(T(x t )p) 



Table 1.1: Summary of Important Notations 
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Chapter 2 



Equivalence of HMMs 



As discussed in the previous chapter, many different Hidden Markov Models can rep- 
resent the same stochastic process. Prior to addressing questions about the expressive 
power of HMMs, it is important to understand exactly when two models M. and J\f are 
equivalent in the sense that they represent the same statistics. In Section 2.2 we will 
see how to determine when two prior distributions on a given HMM induce identical 
stochastic processes. Section 2.3 discusses equivalence of Initialized Hidden Markov 
Models. Section 2.4 shows how to determine whether two HMMs are representations 
for the same class of stochastic processes. This will lead, in the next chapter, to 
a fundamental characterization of the degree of freedom available in a given model. 
This characterization will be used to reduce HMMs to minimal canonical forms. 

2.1 Definitions 

We begin by defining what we mean by equivalence of Hidden Markov Models. First 
of all, we should say what it means for two stochastic processes to be equivalent. 

Definition 2.1 (Equivalence of Stochastic Processes) 

Suppose X and y are two stochastic processes on the same discrete alphabet O . For 
each x £ O* let Pr^(x) be the probability that after \x\ steps the process X has emitted 

23 
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the string x. Define Yvy(x) similarly. Then we say that X and y are equivalent 
processes (X •&■ y) if and only ifPix(x) = PTy(x) for every x £ O* 

In Chapter 1 we discussed the interpretation of an Initialized Hidden Markov Model 
(IHMM) as a finite-state representation for a stochastic process, and we defined the 
probability distribution over strings induced by the process. We can use these defini- 
tions to say what we mean by equivalence of Initialized HMMs. 

Definition 2.2 (Equivalence of Initialized HMMs) 

Let KA and J\f be two Hidden Markov Models with the same output set, and initialized 
by priors p and q respectively. We wish to say that these initialized models are equiv- 
alent if they represent the same stochastic process. So we say that KA(p) is equivalent 
to Af(q) (KA(p) <^> Kf(q)) if and only if Yv(x\KA,p) = Pr(x|A/", q) for every x £ O* . 
This is the same as saying that KA(p) <^> Af{q) exactly when, for every time t, the joint 
probability of the output with the entire previous output sequence, is the same for both 
models. In the notation of Chapter 1 we can write this as: BxT»(:c)p = B^T^(x)(f 
for every x £ O* U {e}. 

In Chapter I we also mentioned that different prior distributions on the same HMM 
could induce the same stochastic process. In order to identify the conditions under 
which this can occur we make the following definition. 

Definition 2.3 (Equivalence of Prior Distributions) 

Let p and q be two different prior distributions on an HMM KA = (<S, (9, A, B). We 
say that p and q are equivalent priors for A4 (p = q) if and only if A4(p) <^ A4(q) i.e., 
if and only if the Initialized HMMs derived by fixing the priors on KA are equivalent. 

We are now ready to define equivalence of Hidden Markov Models. As discussed in 
Chaper I, HMMs can be treated as finite state representations for classes of stochastic 
processes. We would like to say that two HMMs are equivalent if they represent the 
same class of processes. 
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Definition 2.4 (Equivalence and Subset Relations for HMMs) 

Let AA and Af be two HMMs with the same output set. Let p and q denote prior 
distributions on AA and Af repsectively. We say that Af is a subset of AA (Af C 
AA) if and only if for each q on Af we can find a corresponding p on AA such that 
AA(p) <^> Af(q). In other words, Af is a subset of AA if and only if the class of 
processes represented by Af is a subset of the class of processes represented by AA. We 
can then write AA is equivalent to Af (AA <^> Af) exactly when Af C AA and AA C Af . 

The basic intuition underlying all the results concerning the equivalence of HMMs 
is the following: The output distributions of an HMM are linear transformations 
that map an underlying dynamics on the states onto a dynamics on the space of 
observations. Heuristically, it must be the case that the components of the dynamics 
on the states that fall in the null-space of the output matrix must represent degrees of 
freedom that are irrelevant to the statistics on the outputs. So, for example, we will 
see that two prior distributions on a model are equivalent if and only if their difference 
falls in a particular subspace of null-space of the output matrix. All the algorithms 
discussed in this chapter will achieve their goals by rapidly checking properties of 
various vector spaces associated with HMMs. 



2.2 Equivalence of Priors 

When do two prior distributions on a given model induce the same stochastic process? 
This is the most basic question that we would like to answer. Using the notation 
developed in Chapter 1, and the definition of equivalent Initialized HMMs, we can 
write the condition for equivalent priors as follows: p = q if and only if BT(i)p = 
BT(x)(f for every x £ O* U {e}. Let 6 = p — q. Then we can rephrase this as: 
BT(i) \p — q\ = BT(x)<5 = for every x £ O* U {e}. In other words p = q if and 
only if for every string x £ O* U {e} we can say that T(x)6 is a vector that falls in 
the null-space of the output matrix B. This can be expressed in more geometrical 
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terms as follows. 

Theorem 2.1 Equivalence of Priors (Geometrical Interpretation) 1 
Suppose AA = (<S, (9, A, B) is a Hidden Markov Model with n states and k outputs. 
Let p and q be two prior distributions on Ai with 8 = p — q. Let Af denote the null- 
space of the linear transformation B and let X be the largest subspace of Af that is 
invariant under each of the transformation operators T(o 8 ). Then p = q if and only 

if 8 el. 

Proof: First of all suppose 8 e X C J\f. Then because X is invariant under all the 
T(o 8 ') we know that T(o 8 -)<5 e X and, by induction, we can say that for every x = 
[o(l), o(2), • • • , o(t)] e O* it is true that T(x)8 = T(o t ) ■ ■ ■ T( 0l )8 e X. We conclude 
that T(x)8 e Af for every x £ 0*U{e}. Therefore, by our earlier discussion, pis equiv- 
alent to q. This proves the sufficiency of our condition for equivalence. Next we prove 
necessity. Suppose that p = q. Then let T> = \8(x) : 8(x) = T(x)8, x £ O* U {e} > 
be the set of all differences between T(x)p and T(x)q for every string x. If 8(x) is 
any vector in T> and T(o 8 ) is any transition operator, then T(oi)8(x) is also in T>. 
So T> is invariant under the action of the every transition operator and, therefore, 
so is Span(T>). By assumption of equivalence of priors, every vector in T> lies in the 
null-space of B. So Span(T>) C Af. We conclude that Span(T>) is a subspace of the 
largest subspace of Af that is invariant under all the transition operators. This proves 
the necessity of our condition for equivalence. □ 

In effect, the difference between equivalent priors is a vector that lies in a subspace 
that contributes nothing to the probability distribution over outputs, and remains in 
this subspace as the model evolves. It is not enough that 8 simply be in the null-space 



: We remind the reader of the following linear algebraic notions. The null-space of a linear 
transformation B from R n to R & is the subspace of R n that is mapped by B into the k-dimensional 
zero vector. An invariant subspace of a linear transformation T from R n to R n is a subspace V 
such that T maps every vector in V into V. 
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of B because some of the vectors in the null-space may contribute to the dynamics 
of the system and affect later distributions over outputs. The fact that 8 lies in an 
invariant subspace of the null-space guarantees that 8 will never contribute to the 
distribution over outputs, even after the model evolves. Figure 2.1 shows a simple 
example in which all the states have the same output distribution, so that the null- 
space of B consists of all vectors that sum to zero. Furthermore, for every z, B 8 is 
proportional to the identity so that T(o 8 ) is proportional to A. Since A is stochastic 
it preserves sums and so we see that the space of vectors which sum to zero is an 
invariant subspace of every T(o 8 ). For any priors p and q we know that 8 = p— q sums 
to zero since p and q are both stochastic. So, as we would expect for this degenerate 
case, Theorem 2.1 tells us that all prior distributions on the model induce equivalent 
stochastic processes. 

Although Theorem 2.1 gives a good understanding of why two priors may be 
equivalent for a model, it is not in a form that is immediately useful for developing a 
quick algorithm. So we prove another form of the theorem that will be used directly 
in the algorithm of Figure 2.2 

Theorem 2.2 Equivalence of Priors 
Let KA = (<S,(9, A,B) be a Hidden Markov Model. Suppose p and q are two prior 
distributions on KA with 8 = p—q. Define T> = \ 8(x) : S(x) = T(x) 8, x £ O* U {e} \, 
and let V be any collection of vectors in T> that forms a basis for the vector space 
spanned by the elements of T) . Then p = q if and only every vector in V lies in the 
null-space of B . 

Proof: First suppose that p = q. Then V C T> and so, from the previous discussion, 
every vector in V must fall in the null-space of B, proving the necessity of the theo- 
rem. Now suppose that BiTj = for every vector v 3 £ V. Then, since V is a basis for 
the span of T> } for every 8i £ T> there exists a collection of coefficients {c 8J } such that 
Si = Y,\=i c ijV 3 - So, for every Si we can write B<5 8 - = B ]Cj=i CijVj = Y,\=i c ij (B£j) = 0. 
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M = (S,0,A,B) |S|=3 |0|=2 

1) and of are prior distributions on the states of M. 
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This probability simplex shows the set 
of valid prior distributions on the three 
states of model M. The dotted arrow 
shows the difference between two 
priors. See text for discussion. 



Figure 2-1: Geometrical Interpretation of Equivalence of Priors 

This is the same as saying that BT(i)(5 = for every x £ O* U {e}. Consequently, 
we have the desired result that p = q. □ 

Theorem 2.2 provides a necessary and sufficient condition for equivalence of priors 
on a Hidden Markov Model. We can use it to construct an algorithm by quickly 
generating the basis V of the theorem and checking that the elements of the basis fall 
in the null-space of B. The algorithm in Figure 2.2 does exactly this. 2 We will now 



2 Our procedure for checking equivalence of priors can be optimized in various ways. One such 
optimization will be presented in the analysis of the running time of the algorithm. We present the 
algorithm of Figure 2.2 because it is easier to explain. 
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argue that the algorithm is correct and proceed to calculate its running time. 



Given: An HMM M = (<S, O, A, B) 
where \S\ = n, \0\ = k 
And priors p and q on KA 



1. V={} 

2. Queue= { 6} 

ft Step 1: Find a Basis 

3. Until (|Queue| = 0) or (|V| = n) do 

4. Let / = hrst element in Queue 

5. Remove / from Queue 

6. If / £" Span(V) Then 

7. Add f to V 

8. For each o 8 - £ O do 

9. Add T(oi)f to Queue 

ft Step 2: Test £/ie basis 

10. For each u £ V do 

11. If Bv ^ Then Return(NOT-EQUIVALENT) 

12. Return(EQUIVALENT) 



Figure 2-2: Algorithm for Detecting Equivalence of Priors 



Correctness: The algorithm of Figure 2.2 proceeds in two steps. In Step 1 it 
finds a basis V and, in Step 2, it checks the necessary and sufficient condition for 
equivalence given in Theorem 2.2. So, it checks equivalence of priors correctly if V is 
indeed a basis for the span of T> = I 8{x) : 8{x) = T(x) 6, x £ O* U {e} >. In order to 
analyze the algorithm we will use the terminology that the vector T(o 8 )u is a child of 
the vector v. When the basis finding step of the algorithm terminates, V contains 
a linearly independent collection of vectors. If the step terminated because |V| = n, 
we must have a basis for Span(T>) since the vectors in T> are n-dimensional. Suppose 
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now that the basis finding step terminated because Queue was empty. Each child of 
each of the vectors in V was added to Queue by line 9. So each of these children 
is either in V or was found to be a linear combination of a set of vectors in V. 
Let C denote the set of children of elements of V that are not themselves in V. 
Then we can write c 8 - = Yti-ev dijVj ^ or ever y Q G C. Suppose now v G T> is not 
in V and is not a child of a vector in V. By construction of the algorithm we 
can find some string x and some c 8 - which is a child of a vector in V such that 
T(x)ci = v. We wish to show that every such v is in the span of V. We will do 
this by induction on the length of the string x. If \x\ = 1 so that x = Ok G } 
then for some c 8 - we know that v = T(ok)ci = T(ofc) J2$ ev dijVj = J2$ ev dijT(ok)vj. 
So we see that v is a linear combination of children of elements of V, which all 
necessarily fall in the span of V. Hence v falls in the span of V if v = T(x)c 8 - for 
any x of length one and any c 8 - G C. Now assume that for every x such that \x\ < t 
we know that v = T(x)c 8 - is in the span of V. So we write that v = Y^v-ev d v jvj. 
Then for every string y = xok of length t + 1 we know that there is a Cj such that 
u = T(y)cj = T(ok)T(x)ci = T(ok)v = T(ofc) Yd ev d vl vj. Taking the multiplication 
by T(ofc) into the sum we see that u is a linear combination of vectors in V and their 
children, all of which fall in Span(V). So u G Span(V) also. By induction on t = \x\, 
all v G T> are in the span of V. Therefore, as claimed, V is a basis for the span of 
the vectors in T>. The second step of the algorithm then evaluates the necessary and 
sufficient condition of Theorem 2.2 on the basis generated in the hrst step. Therefore, 
our algorithm is correct. □ 

Running Time: We will now compute the worst case running time of the equiv- 
alent priors algorithm asuming unit cost arithmetic operations. Once the basis V is 
generated in Step 1, the check performed in Step 2 takes 0(n 2 k) time since |V| < n 
and each multiplication by B takes time 0(nk). In addition, it takes 0(n 2 k) time to 
generate all the T(o 8 ) matrices used in the algorithm from the given A and B ma- 
trices. To analyze Step 1, we observe that each multiplication of / by T(o 8 ) in line 9 
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takes time 0(n 2 ). In the worst case the basis V generated in Step 1 will contain n el- 
ements. For every v £ V and every o 8 - £ (9, line 9 adds all vectors T(o 8 )u to Queue. 
So, in all, time 0(n 2 ■ nk) = 0(n 3 k) could be spent extending Queue. The final 
contribution to the running time is from the check in line 6 of the algorithm to see 
if / should be added to the partially generated basis. We observe that / $ Span(V) 
can be tested in time (9(n|V| 2 + |V| 3 ) by standard Gaussian elimination. [press90] 
In the worst case, the hrst n — 1 vectors that are tested in line 6 will be added 
to the basis, and all the remaining nk — (n — 1) vectors in Queue will have to be 
tested to find the last basis vector. So, for large k and n, these tests will take time 
0(n 3 ) ■ 0{nk) = 0(n 4 k). This gives an 0(n 4 k) running time for the algorithm. We 
can do better by being a little more clever about the test in line 6. An optimized 
algorithm would maintain, in addition to the basis set V, a set U of orthonormal 
basis vectors produced by applying the Gram-Schmidt procedure to V. Every time a 
vector / is extracted from Queue, it is orthogonalized against the current set U. If 
the residue of this procedure is the zero vector, / is in Span(XJ) = Span(V), and so / 
is thrown away 3 If the residue is non-zero, / is added to V and the residue is added 
to U. The Gram-Schmidt procedure would take time (9(n|V|) since it just involves 
projection of / onto each of the vectors in U and |U| = |V|. Repeating the earlier 
analysis gives a worst case running time of 0(n 3 k) for this optimized algorithm. 

The next section uses this result concerning equivalence of priors to develop an 
algorithm to test equivalence of Initialized Hidden Markov Models. 

2.3 Equivalence of Initialized HMMs 

In order to develop an algorithm to check equivalence of Initialized Hidden Markov 
Models we will utilize a popular trick from the theory of Finite Automata. Given two 
models we will build a new HMM whose properties will enable us to check equivalence 



3 We are using the term "residue" to mean the piece of a vector that is left after removing all 
components along vectors in a given set. 
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of the two given models easily. (See Figure 2.3) Suppose AA = (<S», O, A», B») and 
Af = (Stf } } A at, Bat) are two HMMs initialized by priors p and q respectively. Then 
we construct a new HMM Q = (Sq, Oq, Aq, Bq) where Sq = Sm U<Saa and Oq = O. 
If AA has m states, Af has n states and \Ol = k we dehne: 



L G 



B 



Q 



A M 


"mXn 


"nXm 


Aaa 


Bx 


Baa 





f2.r 



(2.21 



(We are using the notation 8X8 ' for the i by i matrix whose entries are all zero.) Es- 
sentially, Q consists of two disjoint HMMs, AA and Af } which have been concatenated 
together as in Figure 2.3. Let f>Q = \p , ah be a prior on Q such that it equals the 
prior p on the states corresponding to AA and is zero on the states corresponding to 
Af . Also dehne c[q = OjVij?] similarly. Then, by construction, it must be true for 
any x £ O* U {e} that Pr(x| AA } p) = Py(x\Q 7 Pq) and also Pr(x|A/", q) = Pr(x|Q, qg). 
So AA (p) -x^- Af(q) if and only if pq and c[q are equivalent priors for our new HMM 
Q. Therefore, as a corollary of the results from the previous section, we can check 
equivalence of two initialized Hidden Markov Models in 0((n + m) 3 k) time if the 
models have n and m states respectively and share an output set of size k. 

In the next section we will investigate algorithms for deciding subset relations and 
and equivalence of Hidden Markov Models. 



2.4 Equivalence of Hidden Markov Models 

In Chapter 1 we discussed the interpretation of HMMs as representations for classes 
of stochastic processes, whose elements are derived by initializing prior distributions 
on the models. Definition 2.4 defined an HMM Af to be a subset of an HMM AA (Af C 
AA) when every process that can be represented by Af can also be represented by AA. 
Equivalence of Hidden Markov Models was defined by saying AA ^ Af exactly when 
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To test equivalence of two Initialized HMMs, KA and Kf } we first construct a larger 
HMM Q, which contains KA and Kf as disjoint internal chains. If p and q are the 
fixed priors on KA and Kf respectively, checking equivalence of the priors (p, 0) and 
(0, q) for the model Q should check that KA and Kf are equivalent Initialized HMMs. 

Figure 2-3: Checking Equivalence of Initialized HMMs 



KA C Kf and Kf C KA. This dehnition partitions HMMs into disjoint equivalence 
classes that are representations for the same sets of stochastic processes. (This does 
not, of course, partition the stochastic processes representable by HMMs into disjoint 
classes since a given process may be representable by non-equivalent HMMs.) Our 
goal in the next chapter will be to find a way of generating a minimal, canonical 
representative of each equivalence class in order to isolate the essential expressive 
degrees of freedom in an HMM. Producing such canonical representations will also 
reduce the computational overhead involved in the use of large models. As a prelude, 
in this section, we will develop an algorithm that will check whether two models 
KA and Kf are in a subset relation to each other. A corollary will let us check 
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equivalence of Hidden Markov Models. We will build up to the algorithm and the 
associated characterization of equivalent HMMs by proving a series of lemmas. 

Let M\ = (Si,C>,Ai,Bi) and M 2 = (<S 2 , £>, A 2 , B 2 ) be two Hidden Markov 
Models. From the definitions we see that M 2 ^ M-i exactly when for every prior 
p 2 on M 2 we can find a prior pi on Mi that makes Mi(pi) -w- Ai 2 (p 2 ). Using the 
dehnition of equivalent Initialized HMMs (Definition 2.2) we can write this as: for 
every prior p 2 on M 2 there exists a prior pi on Mi such that \/x £ O* U {e} we can 
write BiTi(x)pi = B 2 T 2 (x)p 2 . This implies the following lemma which essentially 
says that there is a stochastic matrix that transforms the priors on one machine into 
equivalent priors on the other. 

Lemma 2.1 Transformation of Priors 
If Mi = (Si, O, Ai, Bi) and M 2 = (S 2 , O, A 2 , B 2 ) then M 2 C Mi if and only if 
there exists a stochastic matrix 4 C such that \/x £ O* U {e}, BiTi(x)C = B 2 T 2 (x). 

Proof: First, suppose M 2 C Mi. Let e 2 (z) be a prior on M 2 with all its mass 
on state s 8 -. Let pi(i) be the corresponding prior on Mi such that \/x £ O* U 
{e}, BiTi(x)pi(z) = B 2 T 2 (x)e 2 (z). Such an pi(i) exists by assumption of M 2 C Mi. 
Let C be a matrix whose i th column is pi(i). In other words, C = [pi(l)|pi(2)| • • • |pi(n 2 )] 
where n 2 is the number of states in M 2 . It is clear that any prior on M 2 can be 
written as p 2 = J2f=i P«'e 2 (z) and that we will have: 



Vic £ O* U {e}, B 2 T 2 (x)p 2 
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4 By "stochastic matrix" we mean a matrix whose entries are all non-negative and whose columns 
sum to one 
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Since this is true for any p 2 we can conclude that if KA 2 ^ M-i then \/x G O* U {e} we 
can write BiTi(x)C = B 2 T 2 (x). Furthermore, by construction, C is stochastic. 
To prove the lemma in the other direction, suppose that the matrix C exists and, 
for any prior p 2 on A^ 2 , let p\ = Cp 2 be the corresponding prior on A4\. Then, 
by the definition of equivalence, M.\{p\) -w- Ai 2 (p 2 ) since \/x G O* U {e} we can 
write BiTi(x) (Cp 2 ) = B 2 T 2 (x)p 2 - Since this is true for any p 2 we conclude that 
M 2 CMi. □ 

Lemma 2.1 is not a sufficiently powerful characterization of equivalence of HMMs 
to enable us to construct an algorithm to check equivalence. Essentially, we want to 
find a necessary and sufficient condition that does not require us to examine every 
finite prefix of outputs of a process in order to check the equivalence of models. Our 
previous results achieved this goal by examining the properties of various vector spaces 
and checking an equivalence condition on their bases. The next lemma we prove will 
tell us how to find such a vector space that allows us to relax the equivalence condition 
in Lemma 2.1. In order to do this we need to introduce a little additional notation. 

Definition 2.5 Suffix Matrix 

Let KA = (<S, 0, A, B) be an HMM. Define a suffix matrix S(x) = BT(s) for every 
x G O* U {e}. So Ti(x)ij = Pr(A^ emits xo{\KA started in state Sj). The name 
suffix matrix originates from the observation that if z = xy is a string with prefix 
x and suffix y, then S(z) = E(j/)T(x). Suppose y is any string in O* . Then we can 
always write y = xo{ where o 8 - G O and x G O* U {e}. For any y = xo{ G O* we 
will use the notation <j(y) to mean the i th row o/E(x). The j th component of a(y) 
satisfies the equation c(j/)j = Pr(A^ emits y\A4 started in state Sj). 

Lemma 2.1 implies that if KA 2 C A4\ } then linear dependence amongst the rows of 
Ei(x) implies dependence amongst the rows of E 2 (x). This provides a clue that the 
key to equivalence of HMMs lies in comparing the spaces spanned by the rows of the 
suffix matrix. Investigating this idea leads to the following lemma. 
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Lemma 2.2 Equivalence Condition 

Let M\ = (Si, O, Ai, Bi) and M 2 = (S 2 , O, A 2 , B 2 ) be two Hidden Markov Models. 
Let tii = {&i(y) '■ V £ C*} be the set of all rows of the suffix matrices of A\\. Let 
V = {(Ti(xi), Bi(x 2 ), • • • (Ti(xi)} be a basis for Span(Ui). Then KA 2 ^ M-i if and only 
if there exists a stochastic matrix C that satisfies the following conditions: 

y 0j G O, Bi(o k )C = a 2 (o k ) (2.3) 

\/xi such thai B\(xi) G V, B\(xi)Q = B 2 (xi) (2.4) 

Voj G O and Vai(x) G V, a^x) [Ti(of)C - CT 2 (o 3 )] = (2.5) 

Prior to proving this lemma it will help to gain some intuition for what it means. 
Remember that the matrix C in Lemma 2.1 transforms priors on KA 2 into priors 
on Aii, and that the j th component of B\(x) is the probability of emitting string 
x, having started in state s r Using these two facts we can see that Equation 2.4 
says that that for any choice of priors on A^2 there is a prior on A'fi such that the 
probability of emitting a string y is the same for both models if (T\(y) is in the basis 
for Span (Ui). Equation 2.3 says the same thing for all strings of length one. We will 
eventually use these two facts in the base case of an induction to prove the lemma. 
We will see that Equation 2.5 is a way of saying that if B\(x)Q = (T 2 (x) for some x 
then this condition is also satisfied for any string y that is one symbol longer than x. 
We will use this as the induction step in the proof below. 

Proof: First we will prove that if A^2 ^ Ai\ then Equations 2.3 to 2.5 will be true. 
So suppose that A^2 ^ Ai\. Then by Lemma 2.1, there is a stochastic matrix C such 
that for every x G O* U {e}, every row Bi(xoi) of Ei(x) satisfies B\(xoi)Q = B 2 (xoi) 
where B 2 (xoi) is the corresponding row of Yj 2 (x). This at once makes Equations 2.3 
and 2.4 true. Then we turn to Equation 2.5. Let x be any string in O* U {e} and 
let y = OiX be an \x\ + 1 long string with x as a suffix. Then by assumption of 
A^2 ^ Aii, and using the definition of the suffix matrix we can make the following 
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series of statements: 

«t 1 (j/)C = B 2 (y) 

ai(x)Ti(oi)C = a 2 (x)T 2 (oi) 

ai(x)Ti(oi)C = a 1 (x)CT 2 {o l ) 

=^a 1 (x)[T 1 (o,)C-CT 2 (o i )] = (2.6) 

The second equation is derived from the hrst from the definition of S(x) and a(x). 
The third equation simply replaces cr 2 (x) by <Ti(x)C by assumption of KA 2 C A4\ 
and Lemma 2.1. Since Equation 2.6 holds for every o 8 - G C and for every x such that 
(7\{x) G V, we have proven the necessity of the lemma. Next we will prove the lemma 
in the other direction. Suppose that a stochastic matrix C satisfying the conditions 
of the lemma exists. Then, by Equation 2.3, a\(x)Q = (T 2 (x) for every string x of 
length 1. Then assume that for all x of length less than or equal to / we can write 
(Ti(x)C = a 2 (x). For any such x (\x\ < /) we can write B\{x) = Yl\=idi(Ti(xi) for 
some choice of o? 8 -, where the B\{xi) are elements of the basis V. So, by the induction 
assumption, and Equation 2.4, we can write: B 2 (x) = <Ti(x)C = J^lJidi&iix^C = 
J2i=i diB 2 (xi). But this means that for every output o 8 -, we can use Equation 2.5 to 
write: 

|V| 

<ti(o 8 x)C = a 1 (x)T 1 (o,)C = J2 d l a 1 (x,)T 1 (o,)C (2.7) 

8=1 

|V| 
= J2d l ^i{x l )CT 2 {o l ) (2.8) 

8=1 
|V| 

= Y, d Mxi)T2(oi) (2.9) 

8=1 

= a 2 (x)T 2 (oi) = (7 2 (o t x) (2.10) 

We go from Equation 2.7 to Equation 2.8 by applying condition 2.5 of the lemma. The 
next two lines simply substitute the expression for (T 2 (x) obtained from the induction 
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assumption. The conclusion is that if <Ti(x)C = (T 2 (x) for all strings x of length less 
than or equal to /, then the same is true for strings of length / + 1. This completes 
the induction and proves that for every x, we can write B\{x)Q = <? 2 (x), implying 
that Vx G O* U {e}Ei(a;)C = E 2 (x). By Lemma 2.1, this shows that M 2 C Mr. □ 

Lemma 2.2 could be used to build a polynomial time algorithm for testing equiv- 
alence of HMMs. Such an algorithm would begin by generating the basis V in the 
lemma. We would use the efficient basis-generation technique used in Step 1 of our 
algorithm for checking equivalence of prior distributions. Then we would use linear 
programming techniques to find a matrix C satisfying the conditions of the lemma. 5 
■M-2 ^ M-i only if such a matrix is found. Since linear programming problems can 
be solved quickly, such an algorithm would run in polynomial time.([karmarkar84]) 
However, it is possible to do even better. Some recent results in the theory of proba- 
bilistic automata ([tzeng]), that are achieved using methods similar to ours, suggest 
that the following lemma should be true. 

Lemma 2.3 All C matrices are equivalent 

Let Ciand C 2 be any two stochastic matrices satisfying CTi(x 8 )Ci = CTi(x 8 )C 2 for every 
(7\{xi) G V , where V is the basis in Lemma 2.2. Then, for any string x we can write 
<Ti(x)Ci = a 1 {x)C 2 . 

Proof: Suppose x is any string. Then for some choice of d{ we know that (T\(x) = 
J2i=i diO'i(xi) where the B-y{xi) are the elements of the basis in Lemma 2.2. Then it 
is clear that <Ti(x)Ci = J2i=i d l a 1 (x l )C 1 = J2i=i d l a 1 (x l )C 2 = a 1 (x)C 2 . □ 

Collecting all our lemmas together, we can finally state our theorem characterizing 
equivalent Hidden Markov Models. 



5 We need to use linear programming rather than straightforward linear algebra because the 
stochasticity constraints on C involve inequalities. 
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Theorem 2.3 Equivalence of HMMs 

Let M\ = (Si, O, Ai, Bi) and M 2 = (S 2 , O, A 2 , B 2 ) be two Hidden Markov Models. 
Let tii = {Bi(y) '■ V £ C*} be the set of all rows of the suffix matrices of Mi. Let 
V = {(Ti(xi), Bi(x 2 ), • • • (Ti(xi)} be a basis for Span(Ui). Then M 2 ^ Mi if and 
only if the following two conditions hold, (a) There exists a stochastic matrix C such 
that for every X{ satisfying Bi(xi) G V we can write B\(xi)Q = B 2 (xi). (b) For any 
stochastic C satisfying condition (a), the following must be true: 

y 0j eO, Bi(o k )C = B 2 (o k ) (2.11) 
\/oj G O and \/Bi(x) G V, Bi(x) [Ti(oj)C - CT 2 ( 0j )] = (2.12) 



Mi <£> M 2 if and only if M 2 C M\ and Mi C M 



2- 



Proof: The proof follows easily from Lemmas 2.2 and 2.3. Suppose the conditions 
(a) and (b) of our theorem hold, and pick any C satisfying them. This C also satis- 
fies the conditions of Lemma 2.2 so that M 2 C Mi. So conditions (a) and (b) are 
sufficient to guarantee that M 2 C Mi. Next we show that they are also necessary 
conditions. So suppose that M 2 C Mi. First notice that Equation 2.11 says that 
(Ti(x)C = (T 2 (x) for every string x of length 1. Also remember from the proof of 
Lemma 2.2 that Equation 2.12 essentially says that if B\(x) G V, then any string 
y = OiX satisfies the condition B\(y)Q = B 2 (y). Lemma 2.3 tells us that if Ci and 
C 2 both satisfy condition (a), then ct 1 (x)C 1 = Bi(x)C 2 for any string x. So, if any 
C satisihes condition (a) and the equations of condition (b), then every C satisfying 
(a) also satisfies condition (b). By Lemma 2.2 there is a stochastic matrix C satisfy- 
ing condition (a) and Equations 2.11 and 2.12. Therefore, as discussed above, every 
C fulfilling condition (a) also satisfies the equations of (b). This proves that the (a) 
and (b) are necessary conditions for M 2 C Mi to be true. We have already shown 
that they are sufficient conditions and so our proof of the theorem is complete. □ 
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Algorithm: We can use Theorem 2.3 to develop a polynomial time algorithm to test 
equivalence of HMMs. We do this by hrst checking if KA 2 ^ M-i and then checking 
M.\ C KA 2 - So suppose we are trying to check that KA 2 ^ M.\- The subset-checking 
algorithm starts by generating the basis of Theorem 2.3 using the method of Step 1 
in the algorithm for determining equivalence of priors. It then tries to find a matrix 
C satisfying the equivalence condition (a) for this basis. If no such matrix can be 
found, then KA 2 % M.\- If a C satisfying condition (a) is found, we check that it 
satisfies the equations of condition (b). If it passes this test, Lemma 2.2 tells us that 
■M.2 ^ M.\. We check A4\ C KA 2 similarly and answer the question of equivalence 
appropriately. Correctness of this algorithm is immediate from the correctness of our 
earlier algorithm to determine equivalence of priors, and from Theorem 2.3. 

We will now compute the running time of the HMM equivalence algorithm, as- 
suming unit cost arithmetic. First of all, it takes 0{k{n\ + n^)) time to generate all 
the Ti(oi) and T 2 (o 8 ) matrices from the parameters of the HMMs. From our earlier 
analysis, the basis-finding algorithm takes worst-case time 0{n\k) when appropri- 
ately optimized. We also need to compute <t 2 (x 8 ) corresponding to the B\{xi) £ V. 
This can be done at the same time that the basis is generated, simply adding a factor 
of 2 to the cost. Once the basis is generated, finding a matrix C satisfying condi- 
tion (a) involves solving a system of n 2 |V| equations in riin 2 variables, subject to 
n 2 -\-nin 2 stochasticity constraints. Since the constraints involve only linear inequali- 
ties (the columns of C sum to one and Vz,j C 8J > 0) we can solve for C using linear 
programming. ([chvatal80]) Karmarkar ( [karmarkar84] ) gives a worst-case 0(Ln 3 ' 5 ) 
time algorithm for linear programming where n is the number of variables and L 
is size of the linear program in bits. (This is also competitive in practice with the 
simplex algorithm.) It is a somewhat sticky business to translate the bit complexity 
in terms of L into a complexity in terms of the number of variables and equations in 
the linear program. In rough terms, if we are dealing with a fixed number of bits per 
number, we can say that L is of the order of 0(mn), where ran is roughly the size of 
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the linear programming tableau. Using this, we conclude that we can find C, if a solu- 
tion exists, in worst-case time O [{n 2 \V\ + n 2 + nin 2 )(nin 2 ) 4 ' 5 ] = O [(nin 2 ) 5 ' 5 ] where 
we have used the fact that |V| < n\. Once we have generated a matrix C, checking 
that it satisfies Equation 2.11 takes time 0(knin 2 ) and checking Equation 2.12 takes 
time O \n\k(n\ + n\ + 2riin 2 )]. (Once again, we have used the fact that |V| < n\.) 
Gathering all these terms together, and picking the dominant terms as rai, n 2 and 
k grow large, we hnd that our algorithm for checking KA 2 C A4\ runs in worst-case 
time O [n\k{n\ + n\ + 2riin 2 ) + (riin 2 ) 5 ' 5 ). The complexity of checking A4\ C KA 2 
is obtained by exchanging ri\ and n 2 everywhere in this expression. The algorithm 
presented here can be optimized in various ways to do somewhat better, but these 
optimizations are less interesting and more complicated to explain. 
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Chapter 3 



Reduction to Canonical Forms 



In the previous chapter we defined equivalence of stochastic processes and proved how 
and why prior distributions on a model may be equivalent. We used these results to 
characterize equivalent Initialized Hidden Markov Models. Finally, we made various 
appeals to linear algebraic arguments to develop necessary and sufficient conditions 
for the equivalence of HMMs. However, our results concerning equivalent HMMs 
did not give a clear intuitive characterization of the intrinsic expressiveness of Hidden 
Markov Models. In an effort to achieve such a characterization, this chapter will define 
the canonical dimension of a model. The definition is related to our formulation of 
the theorems describing equivalent HMMs, and will lead quickly to an algorithm for 
finding canonical representations of models. All the theorems in this section will 
be proven in the context of Generalized Markov Models (GMMs) which relax the 
postitivity constraints on the parameters of HMMs. We will see that all processes 
that can be modelled exactly by Hidden Markov Models can also be modelled by 
Generalized Markov Models. Some kinds of GMMs, with appropriate restrictions 
placed on the allowable prior distributions, are equivalent to HMMs. In Section 3.2.1 
we will see how the results achieved in this chapter should be modified to apply 
to HMMs. We begin by defining Generalized Markov Models and discussing their 
properties. 

43 
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3.1 Generalized Markov Models 

In this section we will define a new class of models of stochastic processes. Since this 
new class contains the processes modelled by traditional Hidden Markov Models, we 
will christen it the class of Generalized Markov Models. Essentially, the generalization 
involves relaxing the positivity constraint imposed by the probabilistic interpretation 
of the parameters describing the underlying Markov Chain of an HMM. First we will 
discuss why such a generalization may be a good idea, and then we will proceed to 
define GMMs and describe their properties. 

3.1.1 Why Should We Invent GMMs? 

Empirical Reasons: Our hrst motivation for deforming GMMs is empirical. L.Niles, 
in discussing the connections between stochastic classifiers and neural network schemes, 
describes experiments with an HMM- net, a network implementation of an HMM.[niles90] 
He reports that corrective training methods lead to HMM-net parameters that vi- 
olate probability constraints, but are more more successful in classification tasks. 
Niles points out that relaxing the stochasticity constraint on HMM parameters while 
preserving the formal structure 1 results in a perfectly valid classifier and decision- 
boundary model. Of course, the Bayesian formulation of classification is lost. How- 
ever, Bayesian methods are only optimal if the true distributions are known, and this 
is very far from the case in most applications of HMMs. In light of these facts, Niles 
suggests that HMMs with "negative parameters" may be interesting because, in the 
HMM-net formulation of Hidden Markov Models, they have a natural interpretation 
as inhibitory connections. If we wish to follow this lead and investigate the properties 
of various HMM-like models we should be able to analytically compare the properties 
of the different schemes in order to be able to choose between them in a principled 



^y formal structure we mean, for example, the formal manipulations by which posterior proba- 
bilities are extracted from the model. Of course, once the model parameters cannot be interepreted 
as probabilities, we will be computing some non-probabilistic score. 



3.1. GENERALIZED MARKOV MODELS 45 

manner. This thesis initially arose from an attempt to understand the properties of 
HMMs sufficiently well to facilitate comparison with other classification schemes. The 
Generalized Markov Models we will define in this chapter are a natural generaliza- 
tion of HMMs which follow the empirical lead in [niles90] suggesting that "negative 
parameters" may be a good idea. We are able to describe, in detail, the conections 
between GMMs and HMMs. 

Theoretical Reasons: We are also motivated to define Generalized Markov Models 
from a theoretical perspective. First of all, we will take the view that an HMM is 
simply an iterative, finite-state scheme used to represent the statistics of stochastic 
processes. The interpretation of the model parameters as probabilities is peripheral 
to the actual goal of realizing parsimonious and easily manipulated representations 
of wide classes of stochastic processes. Therefore, there is no intrinsic reason why 
the paramaters of the model should be probabilities, unless we derive a clear benefit 
from the constraints imposed by such an interpretation. If we discover that allowing 
negative parameters in our model permits us to build better models, we should not 
allow the probabilistic viewpoint to stop us. Secondly, in vague terms, all the results 
from the previous chapter dealt with general linear combinations of elements of vector 
spaces as opposed to convex combinations of vectors on simplices. (Probabilistic 
parameter spaces normally lead to the latter situation.) It seems natural, therefore, 
to ask whether it is really necessary for the parameters of an HMM-like model to be 
positive in order to successfully model stochastic processes. For example, we may be 
able to define a prior with "negative" parameters, without changing the probability 
distributions over outputs that we care about. Suppose p is a prior on a model A4 } 
and X is an invariant subspace of the null-space of the output matrix. Then we 
can remove the components of a p that lie in X and the resulting vector p' will 
induce the same stochastic process on KA. (See the theorems in Section 2.2) Notice 
that p' may have negative components, although it must still sum to one since the 
vectors in X necessarily sum to zero. Given this fact, define a valid prior to be any (not 
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necessarily stochastic) vector that induces a valid stochastic process when it initializes 
a model. Clearly, from the above discussion, the set of valid priors extends beyond 
the probability simplex. Extending the argument, we could permit the columns of the 
transition matrix A of a model to also be pseudo-stochastic. 2 A generalized model, 
defined by relaxing constraints in this fashion, has the potential to model a wider 
class of processes with the same number of states. This is particularly important 
in pattern recognition applications because it is usually far from clear that the true 
model of the system is a probabilistic function of a Markov Chain. Typically, the 
best we can hope for is to approximate the statistics of a process as closely as possible 
with our model. Therefore, a more expressive formalism could intrinsically provide a 
better model. 

Reasons of Parsimony: The final reason to consider Generalized Markov Models 
is basically an argument that a smaller model is usually better. As discused in the 
previous paragraph, we would like to have more expressive formalisms for modelling 
stochastic processes since we are typically dealing with problems of approximating 
a system. However, if the formalism involves too many degrees of freedom, it will 
suffer from the curse of dimensionality - it will become very difficult to estimate 
the values of the model parameters from the sparse data that is typically available. 
So we basically want to "say more with fewer parameters". We can also make the 
computational argument that, in general, the more parameters we have to manipulate, 
the slower all our algorithms will be. At the same time, the formal methods of 
manipulating HMMs are so easy, intuitive and efficient that we would love to be 
able to keep them. The Generalized Markov Models defined in this thesis achieve 
both these goals by preserving the formal structure of HMMs, but liberating them 
from constraints that limit the class of processes a given number of parameters could 
model. Essentially, we attempt to get more mileage from each parameter of a model 



2 We do not relax the stochastic constraints on the ouput matrix because this makes analysis 
considerably harder. 
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by allowing it to range over a greater domain in a natural way. We will see, for 
example, that the smallest HMM equivalent to a given model may have more states 
that its smallest representation in the GMM formalism. This is our principal reason 
for defining GMMs. 

We can see from these arguments that it may be worthwhile to consider gener- 
alizations of HMMs as techniques for modelling stochastic processes, specially for 
pattern recognition applications. In particular, we have seen that it may be a good 
idea to relax the positivity constraint on the parameters of Hidden Markov Models. 
We will now define Generalized Markov Models and discuss their properties. 

3.1.2 Definition of GMMs 

Our hrst task is to define what we mean by "relaxing the positivity constraint" on 
probabilities. To this end we make the following definition of a pseudo-stochastic 
vector: 

Definition 3.1 Pseudo-probability and Pseudo-stochasticity 

Define an n-dimensional vector v to be pseudo-stochastic if each of its components 
is real and Y^=\ v i = 1- Each entry of such a vector is called a pseudo-probability. 
Pseudo-probabilities of alternative independent events add just like true probabilities. 
Also define a pseudo-stochastic matrix to be one whose columns are pseudo-stochastic 
vectors. A pseudo-Markov Chain is a Markov Chain whose transition matrix and 
prior distribution are both pseudo-stochastic. In the rest of this chapter we will use 
frequently use the term "probability" even when we mean pseudo-probability. The 
usage will be obvious from the context. 

We will define GMMs by essentially replacing the probabilities describing the under- 
lying Markov Chain of an HMM with pseudo-probabilities. We will need to impose 
some additional constraints on allowable priors on to ensure that the model describes 
valid stochastic processes. 
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Definition 3.2 Generalized Markov Models (GMMs) 

A Generalized Markov Model is defined as a quadruple AA = (<S, (9, A, B) where S is 
a set of n states, O is a discrete set of k outputs and B is a stochastic output matrix 
as in the definition of HMMs. Define an n-dimensional pseudo-probability vector v 
to be possible for AA. if the product Bu is a stochastic vector. (In other words v is 
possible if B maps v to a probability distribution over the outputs.) Also define an 
n-dimensional vector u to be valid for AA if u induces a valid stochastic process when 
AA is initialized by u and evolved according to the formal rules specified in Chapter 1. 

We demand that all n-dimensional stochastic vectors be valid for AA. The transition 
matrix A of a GMM must then be a pseudo-stochastic matrix whose columns are valid 
vectors for AA . 

We can see that GMMs are very similar to HMMs except that the underlying chain 
is a pseudo- Markov Chain. By this definition, every HMM is structurally a GMM, 
but in the GMM formulation we would be permitted to initialize the model with 
valid priors that are not stochastic. Definition 3.2 is not very constructive in that it 
does not characterize what the valid priors on a model look like. The results we will 
arrive at in this chapter, including the derivation of canonical forms for GMMs, do 
not require such a characterization. We will return to this sticky issue briefly at the 
end of the section. 

GMM Evolution: We will evolve a GMM forward in time by treating pseudo- 
probabilities formally as if they are true probabilities. In particular projection and 
transition operators are formally defined exactly as in Table I.I. The only difference 
lies in the interpretation of the various quantities. The (ij) th component of the 
transition operator T(ofc) is now understood to be the pseudo-probability that the 
underlying chain will transition from state Sj to s 8 -, weighted by the true probability 
of emitting o^ in state Sj. All probabilities related to the states in an HMM are 
replaced by pseudo-probabilities in a GMM, but we still retain the true probability 
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interpretation of distributions over outputs. The suffix matrix of Definition 2.5 will 
be important to us in our discussion of reduction of HMMs. For any string x, the 
suffix matrix is defined as S(x) = BT(x) where B is the GMM output matrix and 
T(x) is the GMM transition operator for string x. In the context of GMMs S(x) 8J 
is the probability that the model emits the string xo, given a pseudo-probability of I 
that the model started in state s r The meaning of the vectors a(x) in Definition 2.5 
is also appropriately modified. Henceforth, when we speak of transition operators, 
suffix matrices or any other quantity originally defined for HMMs in the context of 
GMMs, we will be referring to these objects interpreted as described above. 

3.1.3 Properties of GMMs 

The most important observation to make about the properties of Generalized Markov 
Models is that all the equivalence results of the previous chapter carry over with only 
minor modifications. In this section we will describe these modifications. First of 
all, we define equivalence of GMMs and Initialized GMMs in exactly the same terms 
as for HMMs. Priors are equivalent if the induce the same stochastic process on a 
model, and initialized models are equivalent if they represent the same stochastic 
process. The essential difference is just that we will allow pseudo-stochastic priors 
and transition matrices. Then, Theorems 2.1 and 2.2 concerning equivalence of prior 
distributions on HMMs apply immediately to equivalence of pseudo-priors on GMMS. 
We can see this is the case because the proofs of these theorems rely only on the linear 
structure of the model and do not depend on any property related to stochasticity. 
Consequently, the characterization of equivalent Initialized HMMs applies at once 
to Initialized GMMs also. At hrst sight, it appears to be a little more difficult 
to translate the theorems concerning equivalence of HMMs into the GMM context, 
because they appear to require various quantities to be stochastic. However, a more 
careful examination shows they only depend on the fact that stochastic vectors sum 
to one. The positivity of probabilities is not used anywhere. We will use this to state 
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the following lemmas concerning equivalent GMMs. We will only sketch the proofs 
since they parallel those of Chapter 2 with minor modifications that the reader can 
easily see. As before, we will say that KA 2 ^ M-i if every stochastic process that can 
be generated by setting a pseudo-prior on KA 2 can also be generated by KA\. 

Lemma 3.1 Transformation of Pseudo-priors on GMMs 
If Mi = (Si,C>,Ai,Bi) and M 2 = (S 2 , O, A 2 , B 2 ) are GMMs, then M 2 C M x 
if and only if there exists a pseudo-stochastic matrix Csuch that we can write 
BiTi(x)C = B 2 T 2 (x) for every x £ O* U {e}. Furthermore, suppose we know that 
C is a pseudo-stochastic matrix that transforms the stochastic priors p on KA 2 into 
equivalent valid priors q on A4\. Then C transforms every valid prior on KA 2 into 
an equivalent valid prior on M.\, so that KA 2 ^ M.\- 

Proof: The proof of the hrst part of Lemma 3.f follows the proof of Lemma 2.1. 
Essentially, we consider pseudo-priors e 2 (z) with all the mass on a state s 8 - of M 2 - 
By assumption of A4\ <^> Ai 2} there are equivalent pseudo- priors p\(i) on A4\. The 
Pi(i) are necessarily valid for A4\ because they induce valid stochastic processes by 
assumption. The columns of the transformation matrix C, as in Lemma 2.1, will be 
set equal to the p\(i). The proof then exactly parallels that of Lemma 2.1. To prove 
the second part of the lemma, suppose that C transforms stochastic priors on KA 2 on 
equivalent valid priors on KA\. Then, it transforms the e 2 (z) into pseudo-stochastic 
p\i = C'e 2 (z) such that Pr(x| A^ 2 , e 2 (z)) = Pr(x| A4i 7 pii) for every string x. Next, 
observe that every valid prior p 2 on KA 2 can be written as a linear combination of the 
stochastic unit priors e 2 (z): p 2 = 1^=1 a 8 e 2 (z). Consequently, we can write for every 
x £ O* U {e} that: 

Pt(x\M 2} p 2 ) = ^2a, l Pv(x\M 2 ,e 2 (i)) 

8=1 

= ^Pr^lA^C'e^)) 



8=1 
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™2 



Pt(x\M 1} J2^ C '^(i)) 

8 = 1 

Pr^lA^C'^) (3.i; 



This indicates that it is always true that ^2(^2) "w* ^i(C') so long as p 2 is a valid 
prior for M.2- Since we only assumed that C correctly transformed stochastic priors, 
this proves the second part of the lemma. □ 

The second part of the lemma essentially says that we get equivalence of GMMs 
for free if we can prove that the stochastic priors on a pair of machines can be 
transformed into equivalent pseudo-priors on each other. A corollary of this is that 
equivalent HMMs are also equivalent GMMs. This is true because we know that if 
KA-2 and KA\ are HMMs, and KA^ ^ M-x-, then we can transform stochastic priors on 
KA-2 into equivalent priors on KA\ using the transformation matrix C of Lemma 2.1. 
Therefore, Lemma 3.1 tells us that C also transforms all valid priors on KA^ into valid 
priors on A4-\_, implying that KA^ ^ M-i even when the models are treated as GMMs. 

Finally, we turn our attention to Lemma 2.2 and Theorem 2.3 which proved nec- 
essary and sufficient conditions for the equivalence of HMMs. Using Lemma 3.1, 
and our earlier discussion of the suffix matrix for GMMs, we can see that these 
results can be applied directly in the GMM context. We would simply need to re- 
quire that the transformation matrix C they invoked be pseudo-stochastic instead 
of stochastic. Having convinced ourselves that all the results characterizing equiva- 
lence of HMMs carry over to GMMs also, we see that the algorithms developed in 
Chapter 2 can be applied to GMMs also. We only need to modify the algorithm 
for checking equivalence of un-initialized HMMs by relaxing the stochasticity require- 
ment on the transformation matrix C that it solves for. This actually makes the 
algorithm more efficient since we now only need to solve a system of linear equali- 
ties rather than inequalitites. (We no longer need the constraint that the entries of 
C should be non- negative.) Standard methods for solving systems of linear equalities 
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run in 0(m 3 + m 2 n) time where n is the number of variables and m is the number 
of equations. [press90] Repeating the analysis of the algorithm for determining equiv- 
alence of HMMs, we find that, in the worst case, we will need to solve riin 2 equations 
in riin 2 variables, subject to n 2 pseudo-stochasticity constraints. This would take 
time [(nin 2 ) 3 + (nin 2 ) 2 riin 2 ] = [(nin 2 ) 3 ]. We conclude that our algorithm for 
deciding KA 2 C A4\ } where KA 2 and KA\ are GMMs, has a worst-case running time of 
[n\k{n\ + n\ + 2riin 2 ) + (riin 2 ) 3 ). This is somewhat better than the running time 
achieved in the context of HMMs. As before A4\ <^> M 2 is decided by checking that 
M 2 C M\ and M\ C M 2 . 



Our discussions of Generalized Markov Models have swept an important issue 
under a definitional rug. Our formulation of GMMs is not satisfactory since it does 
not characterize what makes a given pseudo-stochastic vector valid for a given model. 
Consequently, the definition is not clear about exactly what forms the transition 
matrix A is allowed to take. Since this thesis only compares GMMs with each other, 
this does not become a difficulty for us - we will always work with models that are 
presumed to be well-defined. (Obviously, some such models exist since HMMs are 
themselves GMMs with priors restricted to be stochastic.) However, if we want to 
build GMMs for practical applications we must have a more constructive method 
of evaluating the validity of pseudo-stochastic vectors for a given model. At least 
partly because of the non-constructive definition of GMMs, we have not discussed 
the issue of parameter-estimation and training of these models from data. However, 
even without properly understanding the nature of valid vectors for GMMs, we can 
make progress towards developing training algorithms. Some relevant ideas will be 
presented in the next chapter. 
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3.2 Canonical Dimensions and Forms 

We will now define the canonical dimension of a GMM. This will be a measure that 
characterizes the essential degree of freedom available in the model. As decribed 
above, we will freely borrow from the notation defined in Chapters 1 and 2 to ma- 
nipulate HMMs. Distributions over the outputs of the model will remain stochastic. 
However, "distributions" over the states of the model will be pseudo-stochastic. We 
will now use the suffix matrix (Definition 2.5) to define the canonical dimension of a 
Generalized Markov Model. 

Definition 3.3 Canonical Dimension 

Let AA be a Generalized Markov Model with suffix matrices S(x) for every x £ (9*U{e} 
as in Definition 2.5. Also let U = {<?(?/) : y £ O*} be the set of all rows of the suffix 
matrices of AA as in Lemma 2.2. We define the canonical dimension of Ai (dj^) to 
be the dimension of the space spanned by the vectors in U . In other words, dj^t = 
dim [SpaniU)}. 

In order to understand the meaning of the canonical dimension of a model, remember 
that if a(x) £ U, then the j th component of a(x) is the probability that the model 
starts in state Sj, and emits the string x. So, in some sense, the canonical dimen- 
sion of a model captures the maximal degree of freedom we have to define different 
stochastic processes by setting up different valid prior distributions. Our definition 
is also motivated by the following easy result that equivalent GMMs must have the 
same canonical dimension. 

Theorem 3.1 Invariance of Canonical Dimensions 

Let A4\ be a GMM with n\ states and canonical dimension d\. Let AA.2 be any GMM 
with ri2 states that is equivalent to KA\. Let di denote the canonical dimension of 
AAi- Then it must be the case that di = d\ and rii > d\. 
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Proof: If Mi <£> M 2 , then M\ C M 2 and M 2 C .A/fi. Suppose then that 
M 2 C Mi. Then, using Lemma 3.1, and we can write: 

VxGC*U{e}: ai(x)C=a 2 (x) (3.2) 

But we can expand a\(x) in terms of a basis {<Ti(x 8 )} for U } the span of {<Ti(x)}, to 
write: 

di 

a^C = Y, h M^i)C (3.3) 

i 
di 

= Y. h M*i) (3-4) 

i 
di 

=> B 2 {x) = J2bia 2 (xi) (3.5) 

i 

Equation 3.5 shows that the collection of vectors {<? 2 (x 8 )} forms a basis for the span 
of U 2 so that d 2 < |{<?2(^0}l = K^il^Oll = ^i- Similarly, since Mi C A^2 also, 
we can say that Ji < d 2 giving us the result that d\ = d 2 . Finally, notice that the 
canonical dimension of a model M with n states must be less than or equal to n, 
since the a(x) vectors for M will have only n components. So, if M 2 is equivalent to 
M u n 2 >d 2 = di. □ 

Theorem 3.1 tells us that we cannot build a GMM equivalent to Mi with less 
than d\ states. Next we want to show that if .Mi has canonical dimension d\ and n\ 
states, where n\ > di, then we can effectively construct an equivalent model M' with 
only d\ states. We will prove this by hrst demonstrating how a particular special 
type of GMM can be reduced. We will then reduce every GMM to this special form, 
thereby proving the desired result. 

Lemma 3.2 Reduction of a Special Form 
Let M = (<S, (9, A, B) be a GMM with n states. Let X be the largest subspace of the 



3.2. CANONICAL DIMENSIONS AND FORMS 55 

null-space of B that is invariant under each of the transition operators T(ofc). Also 
let Bi andTi(x) denote the i th columns ofTSandT(x) respectively. Suppose that there 
is a collection of coefficients {fij} and an index a, 1 < a < n such that: 

n 

V/<« : B,= J2 fjA (3-6) 

j = a+l 

n 

Vo fc G O andV I < a : Ti(o k ) = ^ fjiTj(o k ) + A[(o k ) (3.7) 

j = a+l 

where Ai(o k ) G X. We will call the states {si, s 2 , • • • s a } the dependent states of 
AA, and {s„_|_i, s„_|_ 2 , • • • s n } the independent states of AA. We can build a model 
AA' = (<S', (9, A', B') with n 1 = n — a states, such that AA.' <^> AA, and S' contains 
only the independent states of AA . 

Prior to proving the lemma it will help to have some intuitions for why it should 
be true. The lemma basically says that a model can be reduced to a smaller size 
if the output distributions are linearly dependent and the corresponding columns of 
every T(ofc) are dependent with the same coefficients. The basic idea of the proof 
is to realize that passing through one of the states si for / < a is indistinguishable 
from passing through the states s m for m > a with pseudo-probabilities weighted 
according to the appropriate linear dependency coefficients. 3 (See Figure 3.2) We can 
use this observation to redistribute the priors and the outgoing probabilties from each 
state in such a way that the linearly dependent states are never visited and can be 
thrown away. The proof below is simply a formalization of this idea. 

Proof: In the following discussion we will adopt the convention that variables 
indexing the states of AA 1 will range over a + 1 to n. Our proof will proceed in five 
steps. First we will define B' and A'. In the second step we will prove an useful 



3 This is true up to the vector A;(o;). However, A;(o;) lies in an invariant subspace of the 
null-space of B. Consequently, it never contributes to distributions over the outputs, and can be 
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A33 + f3 A31 



A22 + f2 A21 




A11 



The figure shows an HMM for which B\ = ^2-^2 + /3-B3 and the T(ofc) satisfy Equa- 
tion 3.7. (We have suppressed the output distributions in the figure.) In order to 
remove the dependent state si, we excise the transitions to s\ and add them to the 
transitions between the independent states weighted appropriately by f 2 and f 3 . The 
priors are redistributed in the same way. If we do this, observe that si is never visited 
and can be thrown away. 

Figure 3-1: Reduction of A Special Form 



invariance property of A'. Next we will define a pseudo-stochastic transformation of 
the priors on KA into priors on KA' . Then we will use the invariance property of A' to 
show that KA C KA' . Finally, we will demonstrate that KA' C KA. We will find it 
convenient to define the following matrix: 



f(a+l)l f{a+l)2 
f(a+2)l f(a+2)2 



/, 



nl 



/, 



n2 



f(a+l)c 

f{a+2)c 

Jna 



(3.8) 
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So F is an n' X a matrix whose components are the expansion coefficients assumed 
in the lemma. Note that F must be pseudo-stochastic, since all the vectors on both 
sides of Equation 3.6 are stochastic and therefore sum to one. We are now ready to 
construct the reduced model KA 1 . 

First of all, we will take the new output matrix B' to simply be the last n' = n — a 
columns of B. Our earlier intutitions concerning A' said that the transitions to 
dependent states should be redistributed according to the weights of the expansion 
coefficients. Putting this idea into symbols gives: 



A 'ij = A^ + ^Ay/vj 



(3.9) 



1=1 



We can use the F matrix defined earlier to compactly write down the relationship 



between A, A', B and B': 



A' = [F\I n , Xn ,]A 

B = B [F I /„/ y „./ 1 



UaXra' 
1 n! Y n! 



(3.10) 

(3.11) 



In'xri is the n' by n' identity matrix and [F|7 n / Xri /] is the matrix consisting of F and 
In'xri concatenated together. aX ri is the a X n' zero matrix. Now suppose that 
P(t,x t -i) is a vector such that P 8 (t,x t _i) is the pseudo-probability that the model 
KA emits the string x t ~\ and then enters the state s 8 - at time t. (This is the pseudo- 
distribution over states before seeing the output at time t.) Then suppose it is also 
true that: 

P'^-i) = [F|7„/ x „/] (P(*,z t -i) + 8) (3.12) 

where 8 is a vector that lies in X. We claim that if Equation 3.12 holds, then the 
joint probability of the output at time t and x t -\ is the same for KA and KA' . Fur- 
thermore, regardless of the output at time t it will be true that P'(t + l,x t ) = 
[F\I n i Xn i] [P(t + 1, x t ) + 8') , where 8' is a vector lying in X, the invariant subsapce of 
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the null-space of B. We can prove the hrst part of the claim by observing that: 

B'P '(*,**_!) = B'[F|7„, x „,](P(<, a ; t _ 1 ) + 5) 
= B(P(*, a ; t _ 1 ) + £) 
= BP(f,a; M ) (3.13) 

The last equation follows because 8 is in the null-space of B. In order to prove 
the second part of the claim we assume without loss of generality that oit) = o k 
and evolve the model forward in time. In order to do this, note that the transition 
operator T'(ofc) can be written as: 



T'K) 



A'B' fc 












(3.14) 


[F / n ' Xn '] A 


UaXra' 
-l-n'xn' 


[v n 'xa\-l-n'xn'\ "k 


UaXra' 
-l-n'xn' 


(3.15) 


[F / n ' Xn '] A 




UaXra 


Bfc 


UaXra' 
-l-n'xn' 


(3.16) 


Un'xa 




-l-n'xn' 



where we have used the fact that B' fc consists of the last n' rows and columns of B^. 
We can simplify this a little further by using the notation Ai = i th column of A to 
write: 













A 




UaXra 














Un'xa 






-l-n'xn' 



B t 



nX a|-4a+l|-4a + 2| • • • |-4nJ Bfc (3-17) 

nxa \f a+1 (o k )\f a+2 (o k )\ ■ ■ ■ \f n (o k )\ (3.18) 



Using this we can conveniently compute P'(t-\-l,x t ) as shown below. We will let 8 and 
8 ' denote vectors in X, the invariant subspace of the null-space of B. For compactness 
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of the equations we will also write T a (o k ) for \0 nXa \T a+1 (o k )\T a+2 (o k )\ ■ ■ ■ \T n (o k )\. 
P'(t + l,x t ) = Tio^P'^x^) 

= [F\In'xri]T a (ok) 

= [F\I n 'xri]T a (ok) 



-l-n'xn' 

o„ 



(3.19) 
[F|7„/ x „/](P(<, a ; t _ 1 ) + 6) (3.20) 

(Pfaxt-J + S) (3.21) 



-l-n'xn' 



We can now use the fact that the columns of T(o k ) are linearly dependent according 
to Equation 3.7 to write: 



T a (o k ) 



r 



J-n'xn' 



QnXa\T a +l{Ok)\T a+ 2(o k )\ • • • \T n (o k ) 



r 



T(o fc ) + [A 1 (o fc )|A 2 (o fc )| • • • |A a (o fc )|0, 
T(o fc ) + A 



-l-n'xn' 

(3.22) 
(3.23) 



where we have set A = Ai(ojt)|A2(ofc)| • • • \A a (o k )\0 nXn t . Observe that for any vector 
x of appropriate dimension, Ai G T since every column of A is an element of X, the 
invariant null-space of B. Therefore, plugging Equation 3.23 into Equation 3.21, we 
find that: 



P'(t + l,x t ) 



[F\I n , Xn ,][T(o k )P(t,x t _ 1 ) + 6 , 
[F\I n , Xn ,](P(t + l,x t ) + 8' 



(3.24) 
(3.25) 



where 8' is some vector in X. 4 Equation 3.25 shows us that if the pseudo-probabilities 
on Ai' satisfy Equation 3.12 at time t, they do so also at time t -\-l and, by induction 
on t, for all future times. This invariance property of A will be useful shortly in 



4 We get Equation 3.24 by using the facts that T(o^)i5 £ X since 6 £ I, and Ax £ X for any x as 
discussed before. 
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proving that KA' and KA are equivalent. 

We are finally in a position to show that KA' C KA. If p is a prior on KA } 
let p' = [F\I n i Xn i]p be the corresponding prior on KA' . These prior distributions 
cause Equation 3.12 to be satisfied for t = and x = e. Therefore, by our earlier 
discussion, Equation 3.12 is satisfied for all times t and strings x t -\- We also showed 
that if Equation 3.12 is satisfied, then the two models have the same probabilities of 
producing the various outputs. Hence, we can conclude that KA'(p') <^> KA(p). Since 
[F|7 n / Xri /] is a pseudo-stochastic transformation of priors on KA into equivalent priors 
on KA' } we know that KA C KA' . To show that KA' C KA } we will hrst show that 
every stochastic prior on KA' can be transformed into an equivalent valid prior on KA. 
Lemma 3.1 will then show that KA' C KA. So suppose that q' is a stochastic prior on 
KA'. Then, construct a prior q on KA such that: 



(0 a ,q) (3.26) 



where a is the a— dimensional zero vector. We can see at once that q' = [F\I n i Xn i] q. 
Therefore, our earlier discussion shows that KA'(q') <^ KA(q). So Equation 3.26 de- 
fines a pseudo-stochastic transformation of stochastic priors on KA' into equivalent 
valid priors on KA. By Lemma 3.1 we can conclude that KA C KA'. Putting every- 
thing together we finally reach the desired conclusion that KA' <^ KA. □ 

All that remains in our quest to find minimal representations for GMMs is a 
way of transforming all reducible GMMs into the special form that was reduced in 
Lemma 3.2. We will now prove a theorem that shows that all reducible GMMs are 
already is the special form of Lemma 3.2. This is then used to reduce GMMs to their 
minimal equivalent representations. 

Theorem 3.2 Reduction of GMMs to Minimal Representations 
Let KA\ = (<Si, (9, A 1? Bx) be a G MM with n x states and canonical dimension o?i < ni. 



~ 




-l-n'xn' 


->l 



3.2. CANONICAL DIMENSIONS AND FORMS 



61 



Then A4\ can be reduced to a minimal equivalent model KA* with only d\ states. If a 
model has only as many states as its canonical dimension, we will call it a minimal 
representation for its equivalence class. 



Proof: We defined the canonical dimension of KA\ to be the dimension of the span 
of U\ = {(T\{y) : y £ 0*} } where the B\(y) are rows of the suffix matrices of KA. Let 
V = {(Ti(xi), <Ti(x 2 ), • • • ,(T\(x<i M )} be a collection of vectors in U\ that forms a basis 
for Span{U\). Then consider a matrix G whose rows are the elements of V. We can 
write: 

(Ti(xi) 



G 



0"l ^2, 



9i 



92 



9n 1 



(3.27) 



o-i{x dl ) 

(In this equation the vectors g 8 - represent the columns of G.) G is a d\ X n\ matrix 
whose rows are linearly independent. So, it has a row-rank d\ and this means that 
its column rank is also d\. So, there are only d\ independent columns in G. Assume, 
without loss of generality, that the last d\ columns of G are the independent columns 
and let a = rt\ — d\. There must be a set of coefficients {/?•;} such that we can write: 



n\ 



V / < a : gi = J2 fji9j 

j = a+l 



(3.28) 



We are going to use this fact to show A4\ already satisihes the conditions of Lemma 3.2 
and can therefore be reduced to a smaller size. In order to do this we will find it 
convenient to introduce the following matrix: 



/(a+l)l f(a+l)2 
f(a+2)l f(a+2)2 



/, 



nl 



/, 



n2 



f(a+l)c 
f(a+2)c 

Jna 



(3.29) 
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This matrix is formally the same as the F matrix used in the proof of the special case 
reduction lemma. We will see that the similarity is not coincidental. We can use the 
F matrix to rewrite Equation 3.28 more compactly as follows: 



G 



-/„ 



(3.30) 



Remember now that every row of every suffix matrix Ei(x) can be written as a linear 
combination of the rows of G. This implies that corresponding to every matrix Ei(x), 
there is another matrix S(x) such that Ei(x) = S(x)G. (The i th row of S(x) contains 
the coefficients expressing the i th row of Ei(x) as a linear combination of the rows of 
G.) Using this we find that: 



VicG0*U{e}: E x ( 



-/„ 



)G 



-L 



(3-31) 



By picking x = e so that Ei(x) = Bi, and expanding the matrix notation into a 



summation, we find that: 



V/<«: B l= J2 fjA 

j = a+l 



(3.32) 



where Bi is the i th column of B.one Notice that this is exactly the hrst condition 
we need in order to apply our earlier lemma on reduction of certain special types of 
GMMs. Next, for notational convenience, we define A(ofc) such that: 



A(o fc ) = Ti(o fc ) 



-L 



(3.33) 



We will refer to the i th columns of Aok and Ti(ofc) as Aj-(ofc) and Ti(ok) respec- 
tively. Then, for any string y = o^x which starts with the output o^ we can write 
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Equation 3.31 as: 



My) 



BxTxfyyTxK) 



B 1 T 1 (x)A(o fc ) = 



(3.34) 



Since this equality holds for every string x £ O* U {e}, we can conclude that the 
columns of A(ofc) are elements of X, the invariant null-space of Bi. By expanding the 
definition of A(ofc) we then find that: 



Vo fc £ O and V / < a : 7](o fc ) = ^ fjiTj(o k ) + A/i 

j = a+l 



Ofc 



(3.35) 



where the A;(ofc) £ X are the columns of A(ofc). Now Equations 3.32 and 3.35 are 
exactly the conditions that make Lemma 3.2 true. Consequently, any GMM with 
canonical dimension oq, has only d\ independent states. The method outlined in the 
proof of Lemma 3.2 can then be used to reduce A4\ to an model M* with only d\ 
states. Since Theorem 3.1 tells us that no smaller model can be equivalent to A4, 
A4* is a minimal representation of KA. □ 



Theorem 3.2 shows how a GMM can be reduced to a minimal representation. We 
will discuss how this result applies to Hidden Markov Models in Section 3.2.1. In ad- 
dition to finding minimal models, we also want our representations to be "canonical" 
in the sense that they are essentially unique. Next, we will prove two theorems that 
provide a deeper understanding of the essential reasons for reducibility of GMMs, and 
characterize the relationship between equivalent minimal representations of a given 
model KA. 

Theorem 3.3 Geometric Characterization of Minimal Represenations 
As before, we will call a model a minimal representation if it is the smallest 
model in its equivalence class. A model is minimal if and only if its invariant null- 
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space X consists of only the zero vector. Hence, priors are equivalent for a minimal 
representation only if they are equal to each other. 

Proof: Let KA = (<S, 0, A, B) be a Generalized Markov Model with n states. We 
remind the reader that the invariant null-space X is the largest subspace of the null- 
space of the output matrix B, that is invariant under the action of every transition 
operator T(ofc). Suppose, hrst of all, that KA is a minimal representation. Suppose 
also that there is a vector 8 £ X which has some non-zero components. By definition 
of being an element of X we can write: 

Vx £ O* U {e} : BT(x)8 = E(;c)<5 = (3.36) 

By picking x = e and x = o^y where y is any string we can write: 

B8 = (3.37) 

Vj/£0*U{ e }: BT(y)[T(o k )8\ = BT(y)A(o k ) = (3.38) 

where we have written A(ofc) for T(ok)8. The second equation says that A(ofc) £ X, 
the invariant null-space of B. Writing this out as an equation for the columns of 
B and T(ofc), and assuming, without loss of generality, that 8i ^ 0, we find that: 

n c 

Bi = E-/4 (3-39) 

71 8- - 
VokCOT^Ok) = Y,-f T Mk) + Hok) (3.40) 

3=2 *i 

(As before, we are writing Bi and Ti(ok) for the the i th columns of B and T(ofc) re- 
spectively.) But this means that s\ is a dependent state, in the sense of Lemma 3.2, 
and can be reduced away. This contradicts the assumed minimality of the model. 
So we see that if KA is a minimal representation, then X can consist only of the zero 
vector. Next we will prove that if X = {0}, then the model is necessarily minimal. So 
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assume that X = {0}. Then suppose that KA is not minimal and therefore n > dj^ti 
where dj^t is the canonical dimension of KA. Theorem 3.2 then tells us that there is 
a collection of coefficients {/ij}, not all of which are zero, such that: 



VicG0*U{e}: E x | 



-1 aXa 



(3.4i; 



where F is defined by Equation 3.29. Let 8 be any column of the matrix to the right 
of S(x) in Equation 3.41. Then 8 is a vector with some non-zero components that 
lies in X. 5 This contradicts our assumption about X, telling us that if X consists only 
of the zero vector, the model cannot have more states than the canonical dimension, 
and is, therefore, minimal. So we have proved that for a model KA to be a minimal 
representation, it is necessary and sufficient that its invariant null-space consists only 
of the zero vector. Observe that, according to the GMM version of Theorem 2.1, this 
implies that equivalent priors on a minimal representation are equal to each other. □ 

Theorems 3.1 and 3.2 told us that this minimal model has exactly as many states 
as its canonical dimension. The result proven just above showed that a minimal model 
can be characterized geometrically as having an invariant null-space consisting only of 
the zero vector. Furthermore, the invariant null-space of a model with n states has a 
dimension n—dj^ where dj^t is the canonical dimension of the model. One consequence 
of this is that no two unequal priors on a minimal model are equivalent. In other 
words, equivalence of priors on a minimal model implies equality of priors. This tells 
us that the minimal representation indeed removes every last shred of redundancy 
available in a model. Every stochastic process that can be modelled by setting the 
priors on the machine is represented precisely once, by a distinct prior. We could use 
this to build an algorithm to reduce a model to its minimal representation. First of all, 



3 This is so because for every string x we know that £(«) = HT(x)8 = 0. 
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we would find the invariant null-space X via standard methods for decomposing vector 
spaces based on their invariance properties under different operators. Then we would 
find a basis for X, and use the basis vectors as shown in the proof of Theorem 3.3 as 
the linear dependency coefficients required by the reduction lemma. However, we can 
build a cleaner algorithm directly from Theorem 3.2. We will do this after proving 
one more theorem which characterizes the relationship between equivalent minimal 
representations in the GMM formalism for a class of stochastic processes. 

Theorem 3.4 Relationship Between Minimal Representations 

Suppose KA = (S, O, A, B) and KA' = (S', O, A', B') are two n-state GMMs, both of 
which are minimal representations of a class of processes with canonical dimension 
&M- Then KA and KA' are related by a change of basis for the n-dimensional space of 
vectors over the states. 

Proof: Since KA and KA' are equivalent models, Lemma 3.f tells us that there are 
two pseudo-stochastic matrices C and C such that: 

Vx G O* U {e} : BT(i)C = B'T'(x) (3.42) 

Vx G O* U {e} : B'T'(x)C = BT(x) (3.43) 

Picking x = e, this tells us that BC = B' and B'C = B. Then, substituting 
Equation 3.43 back into Equation 3.42, and bringing all terms to the right hand side, 
we find that: 

Vx G O* U {e} : B'T'(x) [/ - C'C] = E'(x) [I nXn - C'C] = (3.44) 

This means that the corresponding columns of I nXn and C'C are equivalent priors for 
KA 1 . But we know from Theorem 3.3 that priors on minimal models are equivalent 
if and only if they are equal. So we conclude that C'C = I nXn - Similarly, we find 
that CC = I n xm an d so we can sa y that C and C are non-singular matrices and are 
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inverses of each other. 

Now define the terms state vector space and output vector space to mean the 
vector spaces associated with distributions over states and outputs respectively. We 
will show that KA is the same as model KA' specified in a different basis for the state 
vector space of the model. First all, suppose U is a vector space, and S is a non- 
singular transformation matrix describing a change of basis for U. Then the change 
of basis is described by the following tranformations: 

f . Every x £ U is transformed to Sx 

2. Every linear operator which maps U into U is transformed to SOS -1 . 

3. Every linear operator P mapping U into any other vector space is trans- 
formed to PS -1 . 

Now let S = C and let S _1 = C. Equation 3.43 tells us that the priors on KA are 
mapped onto the priors on KA' by S(i.e., p' = Sp). We have already observed that 
B' = BS _1 . Next, consider the equation BT(j/)T(x)S _1 = B"T'(y)T'(x). Substitut- 
ing for BT(j/) we find that for every y £ O* U {e} we can write 

B'T'(j/)ST(x)S- 1 = B'T'(y)T'(x) (3.45) 

=> B'T'(y) [ST(x)S" 1 - T'(x)} = (3.46) 

This implies that the corresponding columns of ST(x)S _1 and T'(x), when appropri- 
ately normalized to sum to f , would be equivalent priors for KA' . So, by Theorem 3.3 
they are equal to each other and we can write: 

T'(x) = ST(x)S" 1 (3.47) 

From this we also know that Z'fx) = B'T(x) = BS^STfaOS" 1 = BTfaOS" 1 = 
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E(x)S _1 . Summarizing our conclusions we find that: 

p' = Sp (3.48) 

B' = BS 1 (3.49) 

T'(x) = ST(x)S" 1 (3.50) 

E'(x) = E(x)S- x (3.51) 

These equations decribe transformations that are formally identical to a basis trans- 
formation represented by the matrix S. Furthermore, every quantity used to prove 
the theorems of this thesis consisted of sums and products of the quantities in Equa- 
tions 3.48 to 3.51. So we conclude that equivalent minimal representations are related 
by a basis transformation for the state vector space of the models. □ 

Theorem 3.4 tells us that the minimal representation obtained in Theorem 3.2 is es- 
sentially unique, up to a change of basis for the state vector space. So we have indeed 
achieved a satisfactory characterization of the degree of expressiveness in a GMM and 
obtained a minimal, canonical representation for the equivalence classes of GMMs. 
We will now describe an algorithm that will canonicalize a model by reducing it to 
its minimal, canonical representation. 



Reduction Algorithm: In order to construct an algorithm to canonicalize GMMs 
we will follow the proof of Theorem 3.2. In order to reduce a model A4. to its minimal 
equivalent form, we need to generate a basis for the span of U = {&(x) : x £ O*}. 
Using the methods developed in our very hrst algorithm to check equivalence of prior 
distributions, we can generate such a basis in 0(n 3 k) time, where n is the number of 
states and k is the number of outputs. Then we use standard Gausian elimination to 
find the linear dependencies amongst the </ 8 - vectors defined in Equation 3.27. This will 
take time 0(n 3 + n 2 &).[press90] The proof of Theorem 3.2 shows that the coefficients 
of these linear dependencies are the {fij} required by Lemma 3.2 to reduce the model. 
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The reduction procedure takes time 0(n 2 + nk) since we simply have to set the (at 
most) 0(n 2 + nk) parameters of the reduced model according to the rules specified 
in Lemma 3.2. Therefore, for large k and n we can reduce a GMM to its minimal, 
canonical representation in worst-case time 0(n 3 k). 

3.2.1 Results for HMMs 

Hidden Markov Models are derived from the subclass of GMMs with stochastic tran- 
sition matrices by restricting the priors to also be stochastic. This restriction on the 
priors makes it a little difficult to compare HMMs directly to GMMs. However, we 
can make good progress by saying that a GMM KA contains an HMM Kf if for every 
stochastic prior pon Kf } we can find an equivalent pseudo-stochastic prior q on KA. 
In other other words, KA contains Kf if every process that can be modelled by HMM 
Kf can also be modelled by GMM KA. Now let Kfcj denote the GMM derived by 
removing the stochasticity restriction on the priors on Kf. Clearly, if Kfcj ^ KA then 
KA contains Kf } since Nq can model every process modelled by Kf. By definition of 
containment, it is also clear that if KA contains Kf } then all the stochastic priors on 
Kfcj can be mapped to equivalent priors on KA. But, by Lemma 3.1 this means that 
Af G C KA. So we see that GMM KA contains an HMM Kf if and only if Af G C KA 
where Kfcj is the GMM derived by removing the stochasticity restriction on the priors 
on Kf. We can use this to state the following theorem. 

Theorem 3.5 Minimal Representations of HMMs 

Suppose Kf is an HMM and Kf* is the smallest HMM equivalent to Kf. Let Kfcj and 
Kf G denote the GMMs derived by removing the stochasticity constraints on Kf and 
Kf* respectively. Then every GMM KA that contains Kf must satisfy Kfcj ^ KA. 
Furthermore, the minimal HMM Kf* has at least as many states as the smallest GMM 
equivalent to Kfcj. 
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Proof: First of all, suppose that KA is a GMM that contains M . Then we know 
that the stochastic priors on Ma can be transformed into equivalent priors on KA. By 
Lemma 3.1 we can then conclude that Ma ^ Mi. Next, by assumption of A/" -w- M* } 
the stochastic priors on Mg and M G can be transformed onto equivalent priors on 
each other. Therefore, Lemma 3.1 tells us that Mg -w" M g also. Now let M* be 
the smallest GMM equivalent to Mg- Then, since KA* -w- M G and KA* is a minimal 
model, we can conclude that M G has at least as many states as A^*. □ 

As a corollary of this theorem we can show that the smallest GMM containing a given 
HMM M is the minimal representation for Mg- This is because we have shown that 
every GMM KA containing M must satisfy Mg ^ KA. It is easy to show that this 
implies that the canonical dimension of KA must be at least as large as that of Ma- 
lt can also be shown that if A and B are GMMs with the same canonical dimension 
and A C B, then A •&■ B. Putting these facts together we can see that the minimal 
representation of Ma is the smallest GMM we could possibly pick to contain M . 

Theorem 3.5 showed that the minimal HMM representation of a class of processes 
will be at least as big as the minimal GMM containing that class. We can also show 
that if we insist on having a stochastic interpretation of the parameters of a model, 
we may sometimes need many more states than the minimal GMM can achieve. We 
can see this as follows. Notice that the space of distributions on outputs spanned by 
a k-output HMM defines a convex polyhedron on the k — 1 dimensional probability 
simplex. The vertices of the polyhedron are defined by the convex hull of the output 
distributions on the states. By choosing the priors on the model appropriately we can 
explore every corner of the polyhedron. In the worst case, the output distributions 
of every state may fall on the convex hull, and so it would be impossible to build 
a smaller stochastic model of them. However, if we permit ourselves to use general 
linear combinations, we may find that many of the output distributions are linear 
combinations of each other, which leads to potential reducibility. This shows that if 
our goal is to find parsimonious and easily manipulable representations for stochastic 
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processes, using GMMs would appear to be a very reasonable course of action. 

If we insist on using models with stochastic parameters, it is possible to define a 
stochastic canonical dimension of an HMM. This quantity would represent the number 
of "basis vectors" we would need if we only used convex combinations in all the places 
where we currently use general linear combinations. Analysis of this definition is more 
diffcult since the "basis vectors" for convex combinations correspond to vertices of 
convex polyhedra and the wealth of results concerning bases for linear vector spaces 
is not available. However, a brief consideration of the problem suggests that it is very 
likely that an HMM can be reduced to a minimal stochastic representation with only 
as many states as its stochastic caonical dimension. 

We have now concluded the major portion of this thesis. The next chapter will 
discuss further directions of research and point out several questions that were not 
sufficiently investigated in this work. 



72 CHAPTER 3. REDUCTION TO CANONICAL FORMS 



Chapter 4 



Further Directions and Conclusions 



In Chapter 3 we defined Generalized Markov Models, a new class of finite-state repre- 
sentations for stochastic processes, and saw how the results on equivalence of HMMs 
could be extended to GMMs. We used this to define the canonical dimension of a 
GMM and developed a complete characterization of the minimal, canonical represen- 
tations for these models. We also saw how HMMs are related to GMMs, and observed 
that a minimal representation for a stochastic process in the HMM formalism neces- 
sarily has at least as many states as the minimal representation in the GMM model. 
One issue that was not thoroughly investigated in this thesis involves characterizing 
the class of valid priors on a GMM. Since the definition of GMMs was not construc- 
tive, it is not obvious what the space of valid priors on a model looks like. Hence, 
we do not have a characterization of the class of valid transition matrices for GMMs. 
One way of trying to understand this issue is to apply the well-worn vector space 
techniques of this thesis once again, this time to the task of determining whether a 
given pseudo-prior is valid for a model. Similarly, we could determine whether a given 
transition matrix is allowable. In addition to the problem of characterizing the valid 
priors on a model, there are several other important issues that were not considered 
in this thesis. We will discuss these in the sections below. 
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4.1 Reduction of Initialized HMMs 

In Chapter 3 we addressed the problem of reducing GMMs to minimal canonical 
forms. We saw that a class of processes with canonical dimension d needed at least d 
states in its GMM representation. In many applications, after the stage of training 
parameters for a model is completed, we will not actually need the freedom of being 
able to set different prior distributions on the model. In other words, we will actually 
be dealing with an Initialized GMM. Since the model now represents a single process, 
it may be possible to reduce the number of states still further. 1 So we should consider 
how to reduce an Initialized GMM (A4 } p) to a minimal representation (A/", q) such 
that Af(q) <^> Ai(p) and J\f has as few states as possible. 

4.2 Reduction While Preserving Paths 

In some pattern recognition applications of Hidden Markov Models the maximum 
likelihood path producing an output sequence x is as important as the probability 
that x is produced. In such cases, we will be faced with two new issues that were not 
addressed in this thesis. First of all, we will have to give meaning to a "maximum 
likelihood path" in a Generalized Markov Model. Secondly, we will have to find a 
method of model reduction that preserves enough information about them to recover 
the identity of paths in the original model from paths in the reduced model. There 
are some applications in which we are only interested in passage through some small 
number of states rather than the entire path. In such situations, the simplest way of 
achieving reduction while preserving paths would be to declare the appropriate states 
to be irreducible. Such states would never be merged with others in the reduction 
algorithm and so their identity would be preserved in the reduced model. 



^or example, suppose a GMM has one state that loops on itself with probability 1, and we 
initialize the model with all the mass on the looping state. Then, once we have fixed this prior, all 
the other states are clearly unnecessary. 
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4.3 Training GMMs 

When we denned Generalized Markov Models in Chapter 3 we made no mention of 
training algorithms for these models. This was partly because the class of valid transi- 
tion matrices and priors was not characterized, and this makes it difficult to evaluate 
whether a given set of parameters induces valid stochastic processes. Nonetheless, 
there are some options that come to mind immediately. First of all, we could use 
corrective training methods, such as gradient descent to minimize a squared error 
measure. ( Niles [niles90] suggests such a procedure in the context of his HMM-net.) 
Furthermore, despite their exotic underlying chains, GMMs still dehne true probabil- 
ity distributions on their output sequences. Consequently, it still makes sense to think 
about Maximum Likelihood methods where we would attempt to set the parameters 
of the model to maximize the likelihood of a database of examples. The easiest way 
to derive a method for updating the parameters would be to follow the derivation 
of Levinson et al., who treat Maximum Likelihood Estimation in the framework of 
classical constrained optimization. [levinson83] 

4.4 Approximate Equivalence 

Although the results of the previous chapter are a complete characterisation of equiv- 
alence and reduction of GMMs, they can be a little unsatisfying, as the following 
example shows. Suppose KA is a model whose transition amplitudes are all equal and 
all of whose output distributions are linearly independent of each other. According to 
our results, this model is not reducible because there is no degeneracy in the output 
distributions. Indeed, it is true that we cannot build a smaller model that agrees with 
KA at all times. This is because it is always possible to pick priors in such a way as to 
explore the entire valid span of the output distributions of KA } while a smaller model 
could not span a space of the same dimension. Yet, it is clear that after the hrst 
output, the distribution over states will be uniform and the probability of emitting 
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various outputs will always be unchanged. We might like to ignore the hrst output, 
and say that KA can be reduced to a single state, since what happens in the beginning 
is an artifact of the prior. Minor modifications to our results would accomodate this 
- whenever a theorem evaluated a condition for every x £ O* U {e}, we would instead 
evaluate the condition for {x : \x\ > I }. We would also need to appropriately modify 
the vector spaces whose properties were checked in various algorithms developed in 
this thesis. 

However, this brings up the more general question of approximation algorithms. 
Often, we may not care about what happens at early or late times. Or we may not 
care if the statistics defined by two models are exactly the same so long as they are 
close. Approximate equivalence in this sense of "closeness" of models is particularly 
important because the parameters of probabilistic models are usually estimated from 
data. Consequently, exact equivalence will be a rare event. Equivalence ignoring late 
or early times can be easily handled within our methods by various slight modifica- 
tions of our results. The interesting and difficult problem is to define "closeness" of 
stochastic processes appropriately and to prove under what conditions the two GMMs 
are "close" under the definition. 



4.5 Practical Applications 

Finally, we should mention the possible practical applications of this work, particu- 
larly since it was originally begun in the context of building better practical classifiers. 
Statitical methods and models are being increasingly used in pattern recognition and 
other fields. The models built in some applications can be very large ([kupiec90]) and 
reducing them to equivalent models of smaller sizes would be computationally useful. 
However, since the parameters of models are typically estimated from data, they will 
very rarely be exactly reducible and the approximation algorithms mentioned in the 
previous section will be crucial. Since we do not currently have a provably good al- 
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gorithm for approximate reduction, a reasonable preliminary course to take would be 
to substitute tests for linear dependence with tests for "almost" linear dependence in 
all the algorithms and results of the previous chapters. Of course, it is also possible 
to simply simply build a smaller model and retrain it from data rather than reducing 
a larger model. However, if the model would take a long time to train (e.g., if the 
database of examples is very large), or if the large model was constructed for human 
readability and manual fine tuning ([kupiec90]), reduction of a large model would be 
a better course of action. Even if we prefer to retrain a smaller model, the canonical 
dimension defined in Chapter 3 could be evaluated as a way of testing whether a 
smaller model should be built and retrained. The reduction algorithm could also be 
used as a way of finding the structure of a good smaller model that is equivalent or 
nearly equivalent to the original. Even if we retrain the parameters of the reduced 
model, the reduction step would tell us how many states we are likely to need to get 
a good representation of the statistics modelled by the larger model. 

Another potential practical application involves the implementation and evalua- 
tion of GMMs as pattern classifiers. There is some reason to suspect that given a 
GMM and an HMM with n states each, the GMM could perform better as a pattern 
classifier. This is plausible because, given a fixed number of states, a GMM can model 
a wider class of processes than an HMM. In practical applications we are typically 
dealing with the problem of approximating stochastic sequences. There may be pro- 
cesses modelled by n-state GMMs that are much closer to the true process than the 
best approximation we can find in the HMM formalism. In order to understand this 
question from the theoretical viewpoint we would need to make progress along several 
fronts including understanding the approximation properties of HMMs. For example, 
we would need to be able to compare how accurately a given stationary process can 
be represented by HMMs and GMMs with n states each. This is a difficult problem 
worthy of being studied. From the point of view of practical applications, the question 
of the usefulness of GMMs is best resolved empirically in the domain of application. 
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4.6 Conclusion 

This thesis arose from an attempt to build part of a good foundation for pattern 
recognition using Hidden Markov Models. There is a need for analytical tools that 
will enable us to compare different formalisms for pattern recognition and in order 
to predict their relative effectiveness. In this thesis, we have proved several theorems 
that uncover the source of the intrinsic expressiveness of Hidden Markov Models. We 
have shown how to detect equivalence of prior distributions on a model and given 
a geometric characterization of equivalent priors. This led to a characterization of 
equivalent Initialized Hidden Markov Models and then of equivalent HMMs. We have 
given theorems that detect these equivalencies in polynomial time. Next, empirical 
and theoretical motivations led us to define the class of Generalized Markov Models 
which contain HMMs as a subclass. We used the definition to reduce HMMs and 
GMMs to minimal, canonical representations which remove all redundancy from a 
model. We also developed a geometric characterization of the minimal representations 
that gave insight into the source of the expressiveness of GMMs and HMMs. This 
characterization also led to a polynomial time reduction algorithm for Generalized 
Markov Models. These results lay part of a foundation for the principled use of 
finite-state models of stochastic processes in pattern recognition. 
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HMMs and Probabilistic Automata 



There have been some recent results in the theory of Probabilistic Automata (PAs) 
that use methods very similar to ours to decide the equivalence of PAs in polyno- 
mial time.[tzeng] Tzeng's work also discusses a result on approximate equivalence of 
PAs that may provide leads on ways to proceed towards understanding approximate 
equivalence of HMMs and GMMs. In this appendix we will show how HMMs and PAs 
are related. First of all, we will define Probabilistic Automata in Tzeng's formulation. 

Definition A.l Probabilistic Automata 

Let Ai(i } j) denote the set of all i xj stochastic matrices. A Probabilistic Automaton 
U is a 5-tuple (S } E, M, /?, F), where S = {si, s 2 , . . . , s n } is a finite set of states, E is 
an input alphabet, M is a function from E into Ai(n } n), p is a prior distribution on 
the states, and F C S is a finite set of final states. M(a)ji is is the probability that 
U moves from state Si to Sj after reading the symbol a £ E. We say that x £ E* 1 
is accepted by U with probability pu{x) if U ends up in a final state with probability 
Pu{x) on reading x. 

It is clear from this definition that PAs are closely related to HMMs. The matrices 
M(a)ji are similar in meaning to HMM transition operators, barring the complication 



^e are using the standard notation that S* is the set all finite length strings that can be 
produced by concatentating symbols in £ together. 
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of the final states, the rest of the model is similar also. We will show that with an 
appropriate definition of equivalence, Initialized HMMs are a subclass of Probabilistic 
Automata. 

Definition A. 2 Equivalence of PAs and HMMs 

Let U be a Probabilistic Automaton and let (AA,p) be a Hidden Markov Model with 
O = Yj, where O is the out-put set of AA, and E is the input alphabet of U. We will 
say that U and AA are equivalent (U <^> (AA } p)) if Pi(x\AA } p) = pu(%) for every 
x e C*U{e}. 

Definition A. 2 says that a Probabilistic Automaton U is equivalent to a Hidden 
Markov Model AAwhen U accepts strings with the same probability that AA emits 
them. We will now show that under this definition of equivalence, HMMs are con- 
tained in the class of Probabilistic Automata. 

Theorem A.l HMMs C PAs 
The class of Hidden Markov Models is contained in the class of Probabilistic Automata 
when equivalence is defined by Definition A. 2. 

The basic idea of the proof is shown in Figure A. Given an HMM AA, we will build 
a Probabilistic Automaton U with the same states plus one extra state to leak away 
excess probabilities. We will then set up the matrices M(a)ji to mimic the action 
of the HMM transition operators on the states that the two models share. If every 
one of the shared states is a final state of U, this will guarantee that AA and U are 
equivalent. 

Proof: Suppose AA = (S,0, A, B) is any HMM with prior p. Construct a PA 
U = (Su, S, M, p, F) where: 

Su = SU{s v } (A.i) 

E = (A.2) 



F 



s 



M(a) 



Pi if Si G S 

otherwise 

^ai-^-ij It ^8 ? ^j G O 

1 - E Ss g5 M(a)ij if 5j G <S, Si = s v 
if s, = stt, S{ G States 
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(A.3) 
(A.4) 



1 if Si = Si 



su 



(A.5) 



We will prove by induction on the length of strings x that pu(%) = Pr(x|A / (,p) 
for every string x. First of all, both models accept the null string with probability 1, 
since both start with all the mass of a stochastic vector on the states S. Furthermore, 
Pr(s 8 ,e|[/) = Pr(s 8 -, e\A4,p). Now suppose that for every string x, such that \x\ < t, 
it is true that pu(%) = Pr(x|A / (,p) and that for every state s 8 - G <S, Pr(s 8 -, x\U) = 
PT(si } x\Ai } p). Then for any symbol a G S = (9, and every state Sj, it will be true 
that: 

\s\ 
Pt( Sj ,x<t\U) = ^(B„A J8 Pr( 58 ,x|[/)) + Pr( 5[/ ,x|[/)M(a) jC/ (A.6) 

8=1 

\s\ 
= ^B„A J8 Pr( 5j ,x|X,p) (A.7) 

8=1 

= Vi(s 3 ,xa\M,p) (A.8) 



Furthermore, since the accepting states of U are exactly the states in S it also fol- 
lows that pu(xa) = J2s es P r ( s :n xa\U) = Pr(xa\J\A,p). By induction out = \x\, 
we can conclude that for every string x, the probability that x is produced by M. is 
equal to the probability that x is accepted by U. Hence, (Ai } p) -w- U and, therefore, 
HMMs C PAs. On the other hand, it is easy to show that there are probabilis- 
tic automata that cannot be implemented as Hidden Markov Models. For example, 
define the support of a PA to the set of strings that are accepted with non-zero prob- 
ability. The support of an HMM would be the set of strings that are emitted with 
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non-zero probability. Because of the definitions of the models, a PA could have a finite 
support, or even a support that consists only of strings longer than a fixed length. 
Neither of these two cases is possible for an HMM. Furthermore, the various M(a) 
matrices of a PA need not bear any relationship to each other, while the correspond- 
ing transition operators of HMMs are closely related to each other, via the A and 
B matrices. For this reason also it is possible to define PAs that are not equivalent 
to any Hidden Markov Model in our formulation of equivalence. So we conclude that 
HMMs C PAs. □ 




Given an HMM M, we can construct a Probabilistic Automaton 
U that is equivalent to M by copying over the structure of M and 
adding one extra state to soak up excess probabilities. (See text 
for discussion.) 



Figure A-f : Constructing a Probabilistic Automaton Equivalent to An HMM 



Theorem A.f shows that HMMs can be considered a subclass of Probabilistic 
Automata. However, if the number of outputs is large, an n-state PA will require 
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many more parameters to describe it than an n-state HMM. W.Tzeng proves results 
concerning equivalence of Probabilistic Automata using methods that are similar to 
ours.[tzeng] He also discusses the problem of approximate equivalence of PAs and 
arrives at some interesting results. Since HMMs and PAs are so closely related, the 
methods used by Tzeng to extract results concerning approximate equivalence can 
guide us in studying the same question for HMMs. 
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